0% found this document useful (0 votes)

33 views39 pages

Lecture 2

The lecture discussed the history of attention and transformer models, from basic feedforward and convolutional neural networks to recurrent models like LSTMs to current transformer architectures. Transformers use self-attention to compute representations without RNNs or CNNs, making computation easily parallelizable. Recent models like AlphaFold2 use transformer architectures to train end-to-end models for applications in protein structure prediction.

Uploaded by

badrya badhy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views39 pages

Lecture 2

Uploaded by

badrya badhy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Jan.

14, 2022

CS 886 Deep Learning for Biotechnology

Ming Li
01. Word2Vec

02. Attention / Transformers

03. Pretraining: GPT / BERT

CONTENT
04. Deep learning applications in proteomics

05. Student presentations begin

Attention and Transformers
02 LECTURE
TWO
02 Attention and Transformers

Plan: We trace back history to see how attention and transformers have emerged

1. Basic models, related to transduction models and attention.

2. Encoder-Decoder model, using recurrent networks such as LSTM.
3. Transformer models are general models sufficient for almost all biotech
applications (graph models may be treated to be special cases too).
4. For example, DeepMind AlphaFold2 uses depends on a transformer
architecture to train an end-to-end model.
5. The transformer model also makes it easy for large scale biological data
(pre)training.
02 Attention and Transformers

1. Fully connected network, feedforward network

To learn the weights on

the edges
02 Attention and Transformers

2. CNN
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that do convolutional operation.

A filter
02 Attention and Transformers

…
…
Input
Each filter detects a
small pattern (3 x 3).
Convolution 1 -1 -1
Operation -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

Input
Attention
02 and Transformers
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

Input 3 -2 -2 -1
02 Attention and Transformers

3. RNN
Parameters to be learned:
U, V, W
02 Attention and Transformers

Simple RNN vs LSTM

02 Attention and Transformers

Encoder-Decoder machine translation

LSTM
02 Attention and Transformers

Encoder-Decoder LSTM structure for chatting

02 Attention and Transformers

Attention
02 Attention and Transformers
Image caption generation using attention

z0 is initial parameter, it is also learned

A vector for
each region
z0 match 0.7

CNN filter filter filter

filter filter filter

filter filter filter
02 Attention and Transformers
Image caption generation using attention

Word 1
A vector for
each region
z0 z1
Attention to
a region
weighted
CNN filter filter filter sum
filter filter filter
0.7 0.1 0.1
0.1 0.0 0.0

filter filter filter

filter filter filter
02 Attention and Transformers
Image caption generation using attention

Word 1 Word 2
A vector for
each region
z0 z1 z2

weighted
CNN filter filter filter sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0

filter filter filter

filter filter filter
02 Attention and Transformers

Image caption generation using attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio,
“Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
02 Attention and Transformers

More new ideas:

1. ULM-FiT, pre-training, transfer learning in NLP

2. Recurrent models require linear sequential computation, hard to
parallelize. ELMo, bidirectional LSTM.
3. In order to reduce such sequential computation, several models based
on CNN are introduced, such as ConvS2S and ByteNet. Dependency for
ConvS2S needs linear depth, and ByteNet logarithmic.
4. The transformer is the first transduction model relying entirely on self-
attention to compute the representations of its input and output without
using RNN or CNN.
02 Attention and Transformers

Transformer
02 Attention and Transformers

An Encoder Block: same structure, different parameters

02 Attention and Transformers

Encoder

Note: The ffnn is independent

for each word.
Hence can be parallelized.
02 Attention and Transformers

Self Attention 512  4

64
64  3
512

First we create three vectors

by multiplying input embedding
(1x512)
xi with three matrices (64x512):

qi = xi WQ
Ki = xi WK
Vi = xi WV
02 Attention and Transformers

Self Attention

Now we need to calculate

a score to determine how
much focus to place on other
Parts of the input.
02 Attention and Transformers

Self Attention

Formula
64x64 64x512

~ ~
dk=64 is dimension of key vector
z1= 0.88v1+ 0.12v2
02 Attention and Transformers

Multiple heads
1. It expands the
model’s ability to
focus on different
positions.
2. It gives the
attention layer
multiple
“representation
subspaces”
02 Attention and Transformers

The output
is
expecting
only a 2x4 24

matrix,
hence,

If you want some more intuition on attention: watch https://www.youtube.com/watch?v=-9vVhYEXeyQ

02 Attention and Transformers

Representing the input order (positional

The transformer adds a vector to each input embedding. These vectors follow a specific
encoding)
pattern that the model learns, which helps it determine the position of each word, or the
distance between different words in the sequence. The intuition here is that adding these
values to the embedings provides meaningful distances between the embedding vectors
once they’re projected into Q/K/V vectors and during dot-product attention.

More on positional encoding:

https://kazemnejad.com/blog/transformer_arch
itecture_positional_encoding/
02 Attention and Transformers

Add and Normalize

In order to regulate the

computation, this is a normalization
layer so that each feature (column)
have the same average and
deviation.
02 Attention and Transformers

Layer Normalization (Hinton)

Layer normalization normalizes the

inputs across the features.
02 Attention and Transformers The complete transformer

The encoder-
decoder
attention is just
like self
attention, except
it uses K, V from
the top of
encoder output,
and its own Q
02 Attention and Transformers

Note: In decoder, the input is “incomplete”when

calculating self-attention.

The solution is to set future unknown values with “-inf”.

02 Attention and Transformers

Decoder's
Output
Linear
Layer
02 Attention and Transformers How it works

Decoder’
s
Output But what about
Linear Self-attention?
Layer
02 Attention and Transformers

Training and the Loss Function

We can use cross

Entropy.

We can also
optimize two
words at a time:
using BEAM
search: keep a few
alternatives for
the first word.
02 Attention and Transformers

Cross Entropy and KL (Kullback-Leibler) divergence

• Entropy: E(P) = - ΣiP(i)logP(i) - expected prefix-free code length (also optimal)

• Cross Entropy: C(P) = - ΣiP(i) log Q(i) – expected coding
length using optimal code for Q
• KL divergence:
DKL(P || Q) = ΣiP(i)log[P(i)/Q(i)] = ΣiP(i)[logP(i) – logQ(i)], extra bits to code
using Q rather than P

• JSD(P||Q) = ½ DKL(P||M)+ ½ DKL(Q||M), M= ½ (P+Q), symmetric KL

* JSD = Jensen-Shannon Divergency
02 Attention and Transformers

Transformer Results
02 Attention and Transformers Next Lecture: Pretraining

BERT GPT
02 Attention and Transformers

Literature & Resources for Transformers

Vaswani et al. Attention is all you need. 2017.

Resources:
https://nlp.seas.harvard.edu/2018/04/03/attention.html (Excellent
explanation of transformer model with codes.)
Jay Alammar, The illustrated transformer (from which I borrowed many
pictures):
http://jalammar.github.io/illustrated-transformer/

Transformer
No ratings yet
Transformer
33 pages
Transformers
No ratings yet
Transformers
102 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
58 pages
Transformer
No ratings yet
Transformer
41 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Transformers
No ratings yet
Transformers
41 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
10 Attention N Bert
No ratings yet
10 Attention N Bert
55 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
ATV - CVPR'23 Tutorial
No ratings yet
ATV - CVPR'23 Tutorial
152 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
11.1. Queries, Keys, and Values - Dive Into Deep Learning 1.0-Merged-Compressed
No ratings yet
11.1. Queries, Keys, and Values - Dive Into Deep Learning 1.0-Merged-Compressed
55 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Lecture 28 TransformerIntroductionFinal 1
No ratings yet
Lecture 28 TransformerIntroductionFinal 1
69 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
8 pages
Attention Transformer
No ratings yet
Attention Transformer
41 pages
IIT-Roorkee-HR Analytics-29922
No ratings yet
IIT-Roorkee-HR Analytics-29922
25 pages
Transformer
No ratings yet
Transformer
58 pages
Transformers
No ratings yet
Transformers
102 pages
Ai Agents Workflow
No ratings yet
Ai Agents Workflow
67 pages
Fitria (2021)
No ratings yet
Fitria (2021)
15 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
Lecture 10
No ratings yet
Lecture 10
66 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
The Digital Agricultural Revolution: Innovations and Challenges in Agriculture Through Technology Disruptions 1st Edition Roheet Bhatnagar
100% (13)
The Digital Agricultural Revolution: Innovations and Challenges in Agriculture Through Technology Disruptions 1st Edition Roheet Bhatnagar
76 pages
Transformer
No ratings yet
Transformer
59 pages
Upi Fraud Detection Using Machine Learning
No ratings yet
Upi Fraud Detection Using Machine Learning
5 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Istudy
No ratings yet
Istudy
342 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
9 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Neural Network Full Course
No ratings yet
Neural Network Full Course
10 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
Architecture Automation Thesis 1
No ratings yet
Architecture Automation Thesis 1
85 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Numpy For Quantitative Finance
100% (3)
Numpy For Quantitative Finance
471 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Clothing Recognition Using Deep Learning Techniques
No ratings yet
Clothing Recognition Using Deep Learning Techniques
51 pages
Lec 12
No ratings yet
Lec 12
30 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
IV Year CSE 2022 23
No ratings yet
IV Year CSE 2022 23
53 pages
Inam Ullah Khan (editor), Salma El Hajjami (editor), Mariya Ouai - Cognitive Machine Intelligence_ Applications, Challenges, and Related Technologies (Intelligent Data-Driven Systems and Artific (2024, CRC Press) -
No ratings yet
Inam Ullah Khan (editor), Salma El Hajjami (editor), Mariya Ouai - Cognitive Machine Intelligence_ Applications, Challenges, and Related Technologies (Intelligent Data-Driven Systems and Artific (2024, CRC Press) -
373 pages
Transformers
No ratings yet
Transformers
15 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Computational Intelligence in Communications and Business Analytics
No ratings yet
Computational Intelligence in Communications and Business Analytics
369 pages
Transformer Presentation
No ratings yet
Transformer Presentation
15 pages
Batch 4
No ratings yet
Batch 4
80 pages
Artificial Intelligence ME: Manufacturing 6324
No ratings yet
Artificial Intelligence ME: Manufacturing 6324
23 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Purdue AI and ML Dual Master Program - SlimUp
No ratings yet
Purdue AI and ML Dual Master Program - SlimUp
23 pages
Leo Breiman 2001 Random Forest Algorithm Weka - Google Scholar
No ratings yet
Leo Breiman 2001 Random Forest Algorithm Weka - Google Scholar
6 pages
ML 1
No ratings yet
ML 1
13 pages
Transformer
No ratings yet
Transformer
4 pages
Machine Learning Intro
No ratings yet
Machine Learning Intro
27 pages
Etc Sa2
No ratings yet
Etc Sa2
8 pages
Women in Artificial Intelligence AI
No ratings yet
Women in Artificial Intelligence AI
334 pages
Improving Lung and Colon Cancer Detection Using Ensemble Method Approach
No ratings yet
Improving Lung and Colon Cancer Detection Using Ensemble Method Approach
7 pages
2018AEP Threats To Precision Agriculture
No ratings yet
2018AEP Threats To Precision Agriculture
26 pages
Transformer
No ratings yet
Transformer
5 pages
Fake Logo Detection
No ratings yet
Fake Logo Detection
11 pages
ProgrammingForDS17 DataScience
No ratings yet
ProgrammingForDS17 DataScience
13 pages
Speech Processing Question Paper
No ratings yet
Speech Processing Question Paper
6 pages
Converting Time Series Into Supervised Learning Models
No ratings yet
Converting Time Series Into Supervised Learning Models
5 pages
Malak Al-Aabiad CV
No ratings yet
Malak Al-Aabiad CV
2 pages
A Fast Sound Power Prediction Tool For Genset Noise Using Machine Learning
No ratings yet
A Fast Sound Power Prediction Tool For Genset Noise Using Machine Learning
8 pages
Some Case Studies on Signal, Audio and Image Processing Using Matlab
From Everand
Some Case Studies on Signal, Audio and Image Processing Using Matlab
Dr. Hedaya Mahmood Alasooly
No ratings yet
Signal, Audio and Image Processing
From Everand
Signal, Audio and Image Processing
Dr. Hidaia Mahmood Alassouli
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Jan.

CS 886 Deep Learning for Biotechnology

02. Attention / Transformers

03. Pretraining: GPT / BERT

05. Student presentations begin

1. Basic models, related to transduction models and attention.

1. Fully connected network, feedforward network

To learn the weights on

These are the network

Simple RNN vs LSTM

Encoder-Decoder machine translation

Encoder-Decoder LSTM structure for chatting

z0 is initial parameter, it is also learned

CNN filter filter filter

filter filter filter

filter filter filter

filter filter filter

Image caption generation using attention

More new ideas:

1. ULM-FiT, pre-training, transfer learning in NLP

An Encoder Block: same structure, different parameters

Note: The ffnn is independent

Self Attention 512  4

First we create three vectors

Now we need to calculate

If you want some more intuition on attention: watch https://www.youtube.com/watch?v=-9vVhYEXeyQ

Representing the input order (positional

More on positional encoding:

Add and Normalize

In order to regulate the

Layer Normalization (Hinton)

Layer normalization normalizes the

Note: In decoder, the input is “incomplete”when

The solution is to set future unknown values with “-inf”.

Training and the Loss Function

We can use cross

Cross Entropy and KL (Kullback-Leibler) divergence

• Entropy: E(P) = - ΣiP(i)logP(i) - expected prefix-free code length (also optimal)

• JSD(P||Q) = ½ DKL(P||M)+ ½ DKL(Q||M), M= ½ (P+Q), symmetric KL

Literature & Resources for Transformers

Vaswani et al. Attention is all you need. 2017.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.