0% found this document useful (0 votes)
8 views31 pages

Transformer

The document provides an overview of the Transformer architecture in machine learning, particularly its application in natural language processing and computer vision. It explains the advantages of Transformers over traditional models like RNNs and LSTMs, highlighting their ability to process entire sentences in parallel using self-attention mechanisms. The document details the components of the Transformer model, including the encoder, decoder, and the processes of embedding, positional encoding, and multi-headed attention.

Uploaded by

user828306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

Transformer

The document provides an overview of the Transformer architecture in machine learning, particularly its application in natural language processing and computer vision. It explains the advantages of Transformers over traditional models like RNNs and LSTMs, highlighting their ability to process entire sentences in parallel using self-attention mechanisms. The document details the components of the Transformer model, including the encoder, decoder, and the processes of embedding, positional encoding, and multi-headed attention.

Uploaded by

user828306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Transformer

Transformers in Machine
Learning
• Transformer is a neural network architecture used for performing machine
learning tasks particularly in natural language processing (NLP) and
computer vision(CV).
• In 2017 Vaswani et al. published a paper ”Attention is All You Need” in which
the transformers architecture was introduced. The article explores the
architecture, workings and applications of transformers.
• Transformer Architecture is a model that uses self-attention to transform
one whole sentence into a single sentence. This is useful where older
models work step by step and it helps overcome the challenges seen in
models like RNNs and LSTMs
• NB: RNN suffer from the vanishing gradient problem which leads to long-
term memory loss. RNNs process text sequentially meaning they analyze
words one at a time
Need For Transformers Model in Machine Learning
For example, in the sentence: “XYZ went to France in 2019 when there were no cases of COVID and there he met the
president of that country” the word “that country” refers to “France”.

However RNN would struggle to link “that country” to “France” since it processes each word in sequence leading to losing
context over long sentences. This limitation prevents RNNs from understanding the full meaning of the sentence.

While adding more memory cells in LSTMs (Long Short-Term Memory networks)helped address the vanishing gradient issue
they still process words one by one. This sequential processing means LSTMs can’t analyze an entire sentence at once.
For instance the word “point” has different meanings in these two sentences:
•“The needle has a sharp point.” (Point = Tip)
•“It is not polite to point at people.” (Point = Gesture)

Traditional models struggle with this context dependence, whereas, Transformer model
through its self-attention mechanism, processes the entire sentence in parallel addressing
these issues and making it significantly more effective at understanding context.
Architecture
• Let’s consider an example of machine translation. Imagine that we’re translating a French
sentence (‘Je suis étudiant’) into English (‘I am a student’). Let’s initiate our exploration by
considering the model as a black box.

Opening up the black box, we can see


there’s an encoding part, a decoding
part, and connections linking them
together.
• According to the original research paper (Attention Is All You Need), both the encoder
and decoder are composed of a stack of six identical layers (Figure 3). However, this is
a hyperparameter, and one can experiment with other arrangements.
The output of the final Encoder in the stack is passed to the Decoders to guide the
generation of the output sequence.
Transformer components
• Encoder
• The encoder’s job is to process the input
sequence and create a representation
that the decoder can use to generate
the output sequence.
• First, the input sequence goes through
Input Embedding and Position Encoding,
which generates an encoded version of
each word, capturing both its meaning
and position in the sequence.
• All encoders have the same structure
and are made up of two main parts: the
Self-Attention layer and the Feed-
Forward Neural Network.
Encoder
• Initially, the encoder processes its inputs using a
self-attention layer, which helps it understand the
relationships between words in a sentence as it
encodes each word. After this, the output from the
self-attention layer is sent through a feed-forward
neural network, which is the same across all
encoders and works individually on each one.
• To make the model more advanced, researchers
added a technique called residual connection. This
involves wrapping each of the two steps
selfattention and feed-forwardwith something called
layer normalization, which helps the model work
better.
Process of Encoding
• Embedding
• First, we tokenized our input text. Tokenizing is the process of breaking down text into smaller
units called tokens. Tokens can be words, subwords, characters, or even sentences. The main goal
of tokenization is to convert a piece of text into manageable chunks that can be further processed.
• As in most NLP applications, then we convert each input word into a vector using an embedding
algorithm. Embedding is the process of converting tokens into dense vector representations.
These vectors capture semantic meaning and relationships between tokens. The main goal of
embedding is to transform tokens into a numerical format that retains semantic information and
can be used as input for machine learning models.
• This embedding process takes place
exclusively in the bottom-most
encoder. All encoders share a
common feature: they receive a list of
vectors, each of size 512. For the
bottom encoder, these vectors are
word embeddings, whereas for the
other encoders, they are the outputs
from the encoder immediately below.

By padding or truncating all input sequences to the same length, we ensure


that the output embeddings maintain a consistent size, which is essential for
batch processing and model training. The length of this list (Output dimension)
is a hyperparameter that we can set, corresponding to the longest sentence in
our training dataset.
Positional Encoding
• Another key aspect is that the model processes each input token in parallel. By
incorporating Positional Encoding, we retain the information about word order,
ensuring the relevance of each word’s position in the sentence is maintained. The
summation of matrices necessitates that they be matched in size; therefore, positional
encoding dimensions are identical to those of the input embeddings. These positional
encodings are integrated into the input encodings at the base of the encoder.
Positional Encoding
Positional Encoding Layer in Transformers
Example
Self-Attention
• We know that each word in the input sequence is first converted into a
vector using embeddings as described above. As the next step, we create
three vectors: a Query vector (Q), a Key vector (K), and a Value vector (V)
for each word in the sequence. These vectors are obtained by multiplying
the input vector by three different weight matrices that are learned during
training.
• To assess the significance of each word in the sequence relative to the
others, we calculate the dot product of the Query vector of a word with the
Key vectors of all words in the sequence. This generates a set of scores.
These scores indicate the amount of attention we should give to other parts
of the input sentence while encoding a word at a specific position. (Figure 8)
Self-Attention
• After calculating the score, it is divided by the square root of the dimension
of the key vector (√d_k), which helps achieve more stable gradients. In the
original paper, with d_k set to 64, the score is divided by 8. These scores
are then passed through a softmax operation, which normalizes them so
they are all positive and sum to 1. The softmax function transforms the
scores into a probability distribution, highlighting the most relevant tokens
and reducing the influence of less relevant ones.
• The next step involves multiplying each value vector (V) by the softmax
score before summing them. This process ensures that the values of the
important words are preserved while the influence of irrelevant words is
minimized by multiplying them by very small numbers, like 0.001
• The final step is to sum the weighted value vectors, resulting in the output
of the self-attention(Z) layer at a specific position. The resulting vector is
ready to be sent to the feed-forward neural network.
Matrix Calculation of Self-Attention

• Transformers perform calculations using matrix operations. This approach is


much faster and more efficient, especially for processing large amounts of
data.
• By using matrices, the model can handle multiple words and their
relationships simultaneously, significantly speeding up the computation.
Multi-headed attention
• Multi-headed attention extends this self-attention concept by running
multiple attention mechanisms (heads) in parallel, allowing the model to focus
on different parts of the input sequence simultaneously.
• Suppose we have an input sentence: “The cat sat on the mat.” In a single-
headed attention mechanism, the model might focus on the relationship
between “cat” and “mat” when predicting the next word. In a multi-headed
attention mechanism, one head might focus on the relationship between
“cat” and “mat,” while another head focuses on the relationship between
“cat” and “sat,” and so on. This multi-faceted approach helps the model
capture more nuances and context.
Multi-headed attention
• Multi-headed attention introduces multiple sets of Query/Key/Value weight
matrices, rather than just one. In the case of the Transformer model, there
are 8 attention heads, resulting in 8 sets of these matrices for each encoder.
Each set is initialized randomly and, after training, projects the input
embeddings or vectors from lower encoders into different representation
subspaces. This diversity in representation helps the model capture a richer
and more detailed understanding of the input.
Multi-headed attention
• This presents a challenge because the feed-forward layer expects a single
matrix (a vector for each word), not eight separate matrices. Therefore, we
need a method to combine these eight matrices into one
• To achieve this, we concatenate the matrices and then multiply them by an
additional weight matrix, W0
Benefits of Multi-Headed
Attention
• Enhanced Representation: By having multiple heads, the model can
capture different aspects of the input sequence, improving its ability to
understand complex dependencies.
• Parallel Processing: Multi-headed attention allows the model to process
different parts of the input sequence simultaneously, making it more
efficient.
• Rich Feature Extraction: Each head can learn to attend to different
features, providing a richer representation of the input sequence.
Feed-forward network

After the self-attention operation, the output is


passed through a feed-forward network. This FFN is
a simple two-layer fully connected network that is
applied independently to each position in the
sequence. It introduces non-linearity and helps the
model capture complex patterns in the data.
Residual Connections and Layer Normalization

• Both the self-attention mechanism and the feed-forward network are


followed by residual connections and layer normalization. The residual
connections help preserve the information from the original input while
adding the output of the sub-layer, ensuring the model retains important
features. Layer normalization is then applied to stabilize and accelerate
training by normalizing the output of the previous step.
Decoder

• The primary function of the decoder is to take the encoded input and
generate the output tokens one at a time, in an iterative process.
• The decoding process begins with the first step, where the decoder
receives the encoder’s output along with a special start token( like
<start>. Since no output tokens have been generated yet, this first step
relies solely on the input from the encoder.
Decoder
• As the model moves through later steps, it generates the next token by
using both the encoder’s output and the sequence of previously generated
tokens. The decoder uses a masked self-attention mechanism, ensuring
that each token can only attend to tokens that came before it, preserving
the sequential nature of the output.
Decoder

• Instead of attending to future tokens, the decoder focuses on previously


generated tokens, computing a weighted sum of these tokens to
emphasize the most relevant ones. This allows the model to maintain
context, producing a coherent and meaningful sequence in each step.
• While the self-attention layer helps the decoder focus on the previously
generated tokens in the output sequence, the decoder also incorporates
another important mechanism: the encoder-decoder attention sub-layer.
This sub-layer enables the decoder to attend to the relevant parts of the
input sequence encoded by the encoder.
Decoder
• The key idea here is cross-attention, where the decoder generates queries
(Q) from its previous layer’s output, while the keys (K) and values (V) are
derived from the encoder’s output. This allows the decoder to retrieve the
most relevant information from the encoded input, helping it generate a
contextually accurate output sequence.

• Like self-attention, encoder-decoder attention uses multi-headed attention,


which enables the model to focus on different aspects of the input
sequence in parallel. This mechanism ensures that the decoder can
leverage the information encoded from the input effectively, generating
coherent and contextually accurate outputs.
Final Linear Layer and the
Softmax Layer
• Once the decoder stack produces an output vector, this needs to be
converted into a word or token. The final Linear layer projects this output
into a much larger vector, known as the logits vector, with each element
corresponding to a possible word from the model’s vocabulary. For
instance, if the model knows 30,000 words, the logits vector will have
30,000 values, each representing a score for a specific word. The Softmax
layer then transforms these scores into a probability distribution, where
the word with the highest probability is chosen as the next token in the
sequence.
Final Linear Layer and the
Softmax Layer

Through this process, the model generates text in a step-by-step manner, ensuring each word is
contextually relevant based on the input and previously generated tokens.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy