0% found this document useful (0 votes)

8 views31 pages

Transformer

The document provides an overview of the Transformer architecture in machine learning, particularly its application in natural language processing and computer vision. It explains the advantages of Transformers over traditional models like RNNs and LSTMs, highlighting their ability to process entire sentences in parallel using self-attention mechanisms. The document details the components of the Transformer model, including the encoder, decoder, and the processes of embedding, positional encoding, and multi-headed attention.

Uploaded by

user828306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views31 pages

Transformer

Uploaded by

user828306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Transformer

Transformers in Machine
Learning
• Transformer is a neural network architecture used for performing machine
learning tasks particularly in natural language processing (NLP) and
computer vision(CV).
• In 2017 Vaswani et al. published a paper ”Attention is All You Need” in which
the transformers architecture was introduced. The article explores the
architecture, workings and applications of transformers.
• Transformer Architecture is a model that uses self-attention to transform
one whole sentence into a single sentence. This is useful where older
models work step by step and it helps overcome the challenges seen in
models like RNNs and LSTMs
• NB: RNN suffer from the vanishing gradient problem which leads to long-
term memory loss. RNNs process text sequentially meaning they analyze
words one at a time
Need For Transformers Model in Machine Learning
For example, in the sentence: “XYZ went to France in 2019 when there were no cases of COVID and there he met the
president of that country” the word “that country” refers to “France”.

However RNN would struggle to link “that country” to “France” since it processes each word in sequence leading to losing
context over long sentences. This limitation prevents RNNs from understanding the full meaning of the sentence.

While adding more memory cells in LSTMs (Long Short-Term Memory networks)helped address the vanishing gradient issue
they still process words one by one. This sequential processing means LSTMs can’t analyze an entire sentence at once.
For instance the word “point” has different meanings in these two sentences:
•“The needle has a sharp point.” (Point = Tip)
•“It is not polite to point at people.” (Point = Gesture)

Traditional models struggle with this context dependence, whereas, Transformer model
through its self-attention mechanism, processes the entire sentence in parallel addressing
these issues and making it significantly more effective at understanding context.
Architecture
• Let’s consider an example of machine translation. Imagine that we’re translating a French
sentence (‘Je suis étudiant’) into English (‘I am a student’). Let’s initiate our exploration by
considering the model as a black box.

Opening up the black box, we can see

there’s an encoding part, a decoding
part, and connections linking them
together.
• According to the original research paper (Attention Is All You Need), both the encoder
and decoder are composed of a stack of six identical layers (Figure 3). However, this is
a hyperparameter, and one can experiment with other arrangements.
The output of the final Encoder in the stack is passed to the Decoders to guide the
generation of the output sequence.
Transformer components
• Encoder
• The encoder’s job is to process the input
sequence and create a representation
that the decoder can use to generate
the output sequence.
• First, the input sequence goes through
Input Embedding and Position Encoding,
which generates an encoded version of
each word, capturing both its meaning
and position in the sequence.
• All encoders have the same structure
and are made up of two main parts: the
Self-Attention layer and the Feed-
Forward Neural Network.
Encoder
• Initially, the encoder processes its inputs using a
self-attention layer, which helps it understand the
relationships between words in a sentence as it
encodes each word. After this, the output from the
self-attention layer is sent through a feed-forward
neural network, which is the same across all
encoders and works individually on each one.
• To make the model more advanced, researchers
added a technique called residual connection. This
involves wrapping each of the two steps
selfattention and feed-forwardwith something called
layer normalization, which helps the model work
better.
Process of Encoding
• Embedding
• First, we tokenized our input text. Tokenizing is the process of breaking down text into smaller
units called tokens. Tokens can be words, subwords, characters, or even sentences. The main goal
of tokenization is to convert a piece of text into manageable chunks that can be further processed.
• As in most NLP applications, then we convert each input word into a vector using an embedding
algorithm. Embedding is the process of converting tokens into dense vector representations.
These vectors capture semantic meaning and relationships between tokens. The main goal of
embedding is to transform tokens into a numerical format that retains semantic information and
can be used as input for machine learning models.
• This embedding process takes place
exclusively in the bottom-most
encoder. All encoders share a
common feature: they receive a list of
vectors, each of size 512. For the
bottom encoder, these vectors are
word embeddings, whereas for the
other encoders, they are the outputs
from the encoder immediately below.

By padding or truncating all input sequences to the same length, we ensure

that the output embeddings maintain a consistent size, which is essential for
batch processing and model training. The length of this list (Output dimension)
is a hyperparameter that we can set, corresponding to the longest sentence in
our training dataset.
Positional Encoding
• Another key aspect is that the model processes each input token in parallel. By
incorporating Positional Encoding, we retain the information about word order,
ensuring the relevance of each word’s position in the sentence is maintained. The
summation of matrices necessitates that they be matched in size; therefore, positional
encoding dimensions are identical to those of the input embeddings. These positional
encodings are integrated into the input encodings at the base of the encoder.
Positional Encoding
Positional Encoding Layer in Transformers
Example
Self-Attention
• We know that each word in the input sequence is first converted into a
vector using embeddings as described above. As the next step, we create
three vectors: a Query vector (Q), a Key vector (K), and a Value vector (V)
for each word in the sequence. These vectors are obtained by multiplying
the input vector by three different weight matrices that are learned during
training.
• To assess the significance of each word in the sequence relative to the
others, we calculate the dot product of the Query vector of a word with the
Key vectors of all words in the sequence. This generates a set of scores.
These scores indicate the amount of attention we should give to other parts
of the input sentence while encoding a word at a specific position. (Figure 8)
Self-Attention
• After calculating the score, it is divided by the square root of the dimension
of the key vector (√d_k), which helps achieve more stable gradients. In the
original paper, with d_k set to 64, the score is divided by 8. These scores
are then passed through a softmax operation, which normalizes them so
they are all positive and sum to 1. The softmax function transforms the
scores into a probability distribution, highlighting the most relevant tokens
and reducing the influence of less relevant ones.
• The next step involves multiplying each value vector (V) by the softmax
score before summing them. This process ensures that the values of the
important words are preserved while the influence of irrelevant words is
minimized by multiplying them by very small numbers, like 0.001
• The final step is to sum the weighted value vectors, resulting in the output
of the self-attention(Z) layer at a specific position. The resulting vector is
ready to be sent to the feed-forward neural network.
Matrix Calculation of Self-Attention

• Transformers perform calculations using matrix operations. This approach is

much faster and more efficient, especially for processing large amounts of
data.
• By using matrices, the model can handle multiple words and their
relationships simultaneously, significantly speeding up the computation.
Multi-headed attention
• Multi-headed attention extends this self-attention concept by running
multiple attention mechanisms (heads) in parallel, allowing the model to focus
on different parts of the input sequence simultaneously.
• Suppose we have an input sentence: “The cat sat on the mat.” In a single-
headed attention mechanism, the model might focus on the relationship
between “cat” and “mat” when predicting the next word. In a multi-headed
attention mechanism, one head might focus on the relationship between
“cat” and “mat,” while another head focuses on the relationship between
“cat” and “sat,” and so on. This multi-faceted approach helps the model
capture more nuances and context.
Multi-headed attention
• Multi-headed attention introduces multiple sets of Query/Key/Value weight
matrices, rather than just one. In the case of the Transformer model, there
are 8 attention heads, resulting in 8 sets of these matrices for each encoder.
Each set is initialized randomly and, after training, projects the input
embeddings or vectors from lower encoders into different representation
subspaces. This diversity in representation helps the model capture a richer
and more detailed understanding of the input.
Multi-headed attention
• This presents a challenge because the feed-forward layer expects a single
matrix (a vector for each word), not eight separate matrices. Therefore, we
need a method to combine these eight matrices into one
• To achieve this, we concatenate the matrices and then multiply them by an
additional weight matrix, W0
Benefits of Multi-Headed
Attention
• Enhanced Representation: By having multiple heads, the model can
capture different aspects of the input sequence, improving its ability to
understand complex dependencies.
• Parallel Processing: Multi-headed attention allows the model to process
different parts of the input sequence simultaneously, making it more
efficient.
• Rich Feature Extraction: Each head can learn to attend to different
features, providing a richer representation of the input sequence.
Feed-forward network

After the self-attention operation, the output is

passed through a feed-forward network. This FFN is
a simple two-layer fully connected network that is
applied independently to each position in the
sequence. It introduces non-linearity and helps the
model capture complex patterns in the data.
Residual Connections and Layer Normalization

• Both the self-attention mechanism and the feed-forward network are

followed by residual connections and layer normalization. The residual
connections help preserve the information from the original input while
adding the output of the sub-layer, ensuring the model retains important
features. Layer normalization is then applied to stabilize and accelerate
training by normalizing the output of the previous step.
Decoder

• The primary function of the decoder is to take the encoded input and
generate the output tokens one at a time, in an iterative process.
• The decoding process begins with the first step, where the decoder
receives the encoder’s output along with a special start token( like
<start>. Since no output tokens have been generated yet, this first step
relies solely on the input from the encoder.
Decoder
• As the model moves through later steps, it generates the next token by
using both the encoder’s output and the sequence of previously generated
tokens. The decoder uses a masked self-attention mechanism, ensuring
that each token can only attend to tokens that came before it, preserving
the sequential nature of the output.
Decoder

• Instead of attending to future tokens, the decoder focuses on previously

generated tokens, computing a weighted sum of these tokens to
emphasize the most relevant ones. This allows the model to maintain
context, producing a coherent and meaningful sequence in each step.
• While the self-attention layer helps the decoder focus on the previously
generated tokens in the output sequence, the decoder also incorporates
another important mechanism: the encoder-decoder attention sub-layer.
This sub-layer enables the decoder to attend to the relevant parts of the
input sequence encoded by the encoder.
Decoder
• The key idea here is cross-attention, where the decoder generates queries
(Q) from its previous layer’s output, while the keys (K) and values (V) are
derived from the encoder’s output. This allows the decoder to retrieve the
most relevant information from the encoded input, helping it generate a
contextually accurate output sequence.

• Like self-attention, encoder-decoder attention uses multi-headed attention,

which enables the model to focus on different aspects of the input
sequence in parallel. This mechanism ensures that the decoder can
leverage the information encoded from the input effectively, generating
coherent and contextually accurate outputs.
Final Linear Layer and the
Softmax Layer
• Once the decoder stack produces an output vector, this needs to be
converted into a word or token. The final Linear layer projects this output
into a much larger vector, known as the logits vector, with each element
corresponding to a possible word from the model’s vocabulary. For
instance, if the model knows 30,000 words, the logits vector will have
30,000 values, each representing a score for a specific word. The Softmax
layer then transforms these scores into a probability distribution, where
the word with the highest probability is chosen as the next token in the
sequence.
Final Linear Layer and the
Softmax Layer

Through this process, the model generates text in a step-by-step manner, ensuring each word is
contextually relevant based on the input and previously generated tokens.

Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
[9,10] Transformers_3
0% (1)
[9,10] Transformers_3
92 pages
CIS Microsoft Azure Foundations Benchmark v4.0.0
No ratings yet
CIS Microsoft Azure Foundations Benchmark v4.0.0
561 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
Attention is All You Need PPT
No ratings yet
Attention is All You Need PPT
18 pages
Network Speed Dome PTZ Camera Web 3.0 Users Manual V2.0.031
No ratings yet
Network Speed Dome PTZ Camera Web 3.0 Users Manual V2.0.031
185 pages
Quiz1 Answers
No ratings yet
Quiz1 Answers
29 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Toshiba SATELLITE L300-2CR PDF
No ratings yet
Toshiba SATELLITE L300-2CR PDF
230 pages
Financial Technology Fintech Dalam Perspektif Aksi PDF
No ratings yet
Financial Technology Fintech Dalam Perspektif Aksi PDF
16 pages
Transformers in Machine Learning _ GeeksforGeeks
No ratings yet
Transformers in Machine Learning _ GeeksforGeeks
9 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
Transformer
No ratings yet
Transformer
21 pages
Photometric Image Formation
No ratings yet
Photometric Image Formation
8 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Transformers: Intro
No ratings yet
Transformers: Intro
7 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Attention is all you need
No ratings yet
Attention is all you need
19 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
11.4 Area and Arc Length in Polar Coordinates
No ratings yet
11.4 Area and Arc Length in Polar Coordinates
9 pages
DL CO4 PPT-1
No ratings yet
DL CO4 PPT-1
29 pages
ece265p-fahmy-day7
No ratings yet
ece265p-fahmy-day7
93 pages
Image Filtering 1
No ratings yet
Image Filtering 1
86 pages
Transformers
No ratings yet
Transformers
15 pages
chapter_4
No ratings yet
chapter_4
24 pages
Generative AI
No ratings yet
Generative AI
54 pages
lec-12
No ratings yet
lec-12
30 pages
Transformers 22nd April 2025 (2)
No ratings yet
Transformers 22nd April 2025 (2)
67 pages
Transformer
No ratings yet
Transformer
33 pages
Deep learning for Computer vision
No ratings yet
Deep learning for Computer vision
125 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Transformer
No ratings yet
Transformer
10 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
Final Project Synopsys
No ratings yet
Final Project Synopsys
53 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Transformer
No ratings yet
Transformer
59 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
curriculum-ict-technician-level-6
No ratings yet
curriculum-ict-technician-level-6
95 pages
Traditional (Or Not?) ESL Materials
No ratings yet
Traditional (Or Not?) ESL Materials
22 pages
Aiayn
No ratings yet
Aiayn
15 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Transformer
No ratings yet
Transformer
58 pages
Computer Project (Bibliography Remaining)
0% (2)
Computer Project (Bibliography Remaining)
35 pages
Event Management System
No ratings yet
Event Management System
56 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
attention
No ratings yet
attention
15 pages
Inheritance in Frame-Based Systems
No ratings yet
Inheritance in Frame-Based Systems
21 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
AZ 104T00A ENU PowerPoint - 05
No ratings yet
AZ 104T00A ENU PowerPoint - 05
35 pages
Biomass Power
No ratings yet
Biomass Power
25 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Price Chart - Isofts
No ratings yet
Price Chart - Isofts
1 page
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
HDL Position
No ratings yet
HDL Position
11 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Engg study materials based on diff sub
No ratings yet
Engg study materials based on diff sub
35 pages
House Rent Application
No ratings yet
House Rent Application
11 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Transformer
No ratings yet
Transformer
10 pages
R Gupta Presentation
No ratings yet
R Gupta Presentation
12 pages
Learning Guide Unit 3 - Home
No ratings yet
Learning Guide Unit 3 - Home
11 pages
Exercices - Basic Access Control On Linux
No ratings yet
Exercices - Basic Access Control On Linux
8 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
Internal Papers Pyq
No ratings yet
Internal Papers Pyq
9 pages
314008-COMPUTER AIDED DRAWING AND SIMULATION
No ratings yet
314008-COMPUTER AIDED DRAWING AND SIMULATION
9 pages
Lab 1
No ratings yet
Lab 1
6 pages
Transformer
No ratings yet
Transformer
5 pages
Alcor_Robot_EN
No ratings yet
Alcor_Robot_EN
9 pages
Operational Manual Printer
No ratings yet
Operational Manual Printer
106 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Transformer
No ratings yet
Transformer
4 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Refrigeration Unit Design
No ratings yet
Refrigeration Unit Design
4 pages
dm1200 Data Brochure
No ratings yet
dm1200 Data Brochure
3 pages
Download
No ratings yet
Download
2 pages
CM3035 Advanced Web Development
No ratings yet
CM3035 Advanced Web Development
6 pages
1.HC35W42R2
No ratings yet
1.HC35W42R2
3 pages
405 Datasheet
No ratings yet
405 Datasheet
4 pages
USA North Carolina Driver License Online Generator
No ratings yet
USA North Carolina Driver License Online Generator
1 page
AT&T-KAWAII JACKSON[1]
No ratings yet
AT&T-KAWAII JACKSON[1]
1 page
Computer Programming and Problem Solving Explorations
From Everand
Computer Programming and Problem Solving Explorations
Pasquale De Marco
No ratings yet
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
From Everand
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
Prem Timsina
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Transformer

Uploaded by

Transformer

Uploaded by

Transformer

Opening up the black box, we can see

By padding or truncating all input sequences to the same length, we ensure

• Transformers perform calculations using matrix operations. This approach is

After the self-attention operation, the output is

• Both the self-attention mechanism and the feed-forward network are

• Instead of attending to future tokens, the decoder focuses on previously

• Like self-attention, encoder-decoder attention uses multi-headed attention,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.