0% found this document useful (0 votes)
14 views35 pages

The Decoder: Deconstructed

The document provides an in-depth exploration of the Transformer Decoder block, detailing its operations and the architecture of the Transformer model, which includes multiple Encoder and Decoder blocks. It discusses the sequential nature of the Decoder, its role in text generation tasks, and the differences between Encoder-oriented and Decoder-oriented NLP tasks. The document emphasizes the complexity of Transformer models and the importance of understanding both Encoder and Decoder functionalities for applications like machine translation and text summarization.

Uploaded by

krish.bonzo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

The Decoder: Deconstructed

The document provides an in-depth exploration of the Transformer Decoder block, detailing its operations and the architecture of the Transformer model, which includes multiple Encoder and Decoder blocks. It discusses the sequential nature of the Decoder, its role in text generation tasks, and the differences between Encoder-oriented and Decoder-oriented NLP tasks. The document emphasizes the complexity of Transformer models and the importance of understanding both Encoder and Decoder functionalities for applications like machine translation and text summarization.

Uploaded by

krish.bonzo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

sen.pushpal@gmail.

com
VY86LFGKPW
The Decoder Deconstructed

Understanding the operations & intuition


behind the Transformer Decoder block

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
• The Transformer Model Caricature
sen.pushpal@gmail.com
VY86LFGKPW Agenda • The Multiple Blocks inside the Transformer
• The Transformer Encoder Block - A Recap

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Learning Journey
Utilizing PyTorch to Applying
understand the functions Encoder-Decoder
& classes to build an Transformer models to
Encoder-Decoder Machine Translation, and
Transformer from scratch fine-tuning on local corpus

Building a Transformer Model Encoder-Decoder Translation

sen.pushpal@gmail.com Machine Translation Application


VY86LFGKPW Decoder Implementation

Decoders & Transformers Modern Transformer LLMs

A Conceptual Understanding Transformer Architectures

A broad overview of
A deep-dive into
modern NLP
Decoders and how
Transformer
they complete the
Architectures
Transformer model
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model Caricature

Now that we’ve understood the Encoder block, Ich heiße Jack.
let’s zoom out and look at a high-level Output
caricature of the modern Transformer model.

sen.pushpal@gmail.com Encoder Decoder


VY86LFGKPW Stage
Stage

The Modern Transformer Model


Input
My name is Jack.
Note: This Transformer architecture would typically be used for a
Machine Translation task, This
such file is
asmeant for personal
English use by sen.pushpal@gmail.com
to German translation. only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model - High-level Flow
An Encoder-Decoder style architecture is typically used in this type of NLP task, where an
input sequence and output sequence are both required, and the output may be very
different from the input. This is the case for tasks like Translation and Question Answering.

The way this would work is, an input sequence is first passed to the
Encoder stage of the Transformer.

sen.pushpal@gmail.com
VY86LFGKPW

The Encoder stage’s operations eventually compute a high-quality


representation of the input sequence, which has captured its syntactical
& semantic meaning.

The Decoder stage is responsible for eventually “decoding” this


representation to a different sentence, in other words, converting it to the
output needed for a task like Machine Translation or Question Answering.

But now, let’s go deeper into what


This file the
is meant “stages”
for personal here
use by refer to.
sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model - Multiple Blocks

In reality, the Encoder and Decoder stage each Ich heiße Jack.
comprise of several individual blocks of Output
Encoders and Decoders.

Encoder Decoder
sen.pushpal@gmail.com
VY86LFGKPW Encoder Decoder

Encoder Decoder

The Modern Transformer Model


Input
My name is Jack.
Note: As we’ve seen already, each Encoder block itself is a
Neural Network architecture with
This file multiple
is meant transformations.
for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Encoder Block - A Recap

Add & Normalize Layer And all of this is just


one Encoder block!
Feed Feed Feed
Forward Forward Forward
Z1 Z2 Z3
ENCODER

sen.pushpal@gmail.com Add & Normalize Layer


VY86LFGKPW
Z1 Z2 Z3

Self-Attention Layer
X1 X2 X3

Positional Encoding

I This file isLOVE FOOTBALL


meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model

This should give a sense for the level of complexity we’re dealing with in Transformer models.

The original Transformer architecture from 2017, proposed 6 Encoders & 6 Decoders!

So, training multiple such blocks in a Transformer architecture, like the caricature shown,
makes Transformer models far more complex than previous Neural Networks.
This is why it is not even feasible in many cases, to train large Transformer models (which
sen.pushpal@gmail.com
VY86LFGKPW
could have 12 or 24 such blocks) on big datasets with just individual computers.

In the upcoming slides, we shall attempt


to understand the Decoder block of the
Transformer, which has many similarities
with the Encoder, so that we can then
abstract out and speak about the
modern large Transformer architectures
used in NLP.
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We discussed the Modern Transformer architecture for NLP, where the input
sequence is passed to the Encoder stage, the representation of the input
computed from the Encoder stage is then passed to the Decoder stage, and
the Decoder is what eventually gives the output needed for the task.
2. We discussed the Transformer Model with multiple blocks, where each stage of
modern Transformer models consist of several individual blocks of Encoders
sen.pushpal@gmail.com
VY86LFGKPW
and Decoders.
3. We discussed the Transformer Encoder Block, where each Encoder block (&
Decoder Block) itself is a Neural Network architecture that consists of multiple
linear & non-linear transformations and stacking multiple such blocks in an
architecture is what gives the non-linear complexity of the Transformer.

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NLP Problem Categories &
sen.pushpal@gmail.com
VY86LFGKPW
Encoder vs. Decoder

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
• NLP Problem Categories for the Decoder
sen.pushpal@gmail.com
VY86LFGKPW Agenda
• The Encoder vs. the Decoder

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NLP Problem Categories

Before we dive in, let’s understand the


Text
Language Generation &
different NLP problem categories and how
Translation Completion they have different requirements for the
Transformer architectures needed to build
solutions for them.

sen.pushpal@gmail.com
VY86LFGKPW Named Entity Recognition (NER) and
Text Classification are different from
NLP Text
NER
Problems Classification the other four categories of NLP
problems in this chart.
They are Encoder-oriented tasks.

Text Question
Summarization Answering
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NLP Problem Categories - Encoder-oriented Tasks

This is because, NER and Text Classification rely only on the high-quality embeddings
generated for each word by the Encoder stage of a Transformer. These embeddings
are sufficient for them to perform the classification they require.

So, NER & Text Classification can be solved by training an Encoder-Decoder


Transformer to predict the missing word in a sentence (let’s say). This training process
sen.pushpal@gmail.com
VY86LFGKPW
eventually enables the Encoder to create the high-quality embeddings needed for
classification-oriented tasks.

During testing time, NER or Text Classification


would purely use the Encoder part of the
architecture to make the predictions needed.

The Decoder is not required while testing.

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NLP Problem Categories - Decoder-oriented Tasks

The Decoder, however, is relevant when the NLP task involves generating text.
Generating text is the Decoder’s value addition to the Transformer model.
Any modern Transformer model that generates text, such as GPT-3 or ChatGPT (a
“Generative” Transformer), is likely using a Decoder to do so.

sen.pushpal@gmail.com
VY86LFGKPW
The Decoder, then is relevant while making predictions only for the other four NLP
problem categories in the diagram:

Text Generation | Question Answering | Text Summarization | Machine Translation

With that context, let’s now dive deeper into the working of the
Decoder block and deconstruct its operations, to understand
how it generates text and how that differs from what the Encoder
is doing.

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder vs. The Decoder
At a high level, the Decoder only slightly differs from the constitution of the Encoder.

Feed Forward Layer This additional


Encoder-Decoder
Encoder Attention Layer of the
Decoder, allows the
Decoder to specifically
Self-Attention Layer focus on certain parts of
sen.pushpal@gmail.com
VY86LFGKPW the input it receives from
the Encoder, in order to
generate its output.
Feed Forward Layer

Decoder Encoder-Decoder
Attention Layer
Self-Attention Layer

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary

So in order to summarize:
1. We reviewed various NLP problem categories, and which of them are relevant
for a Decoder-oriented architecture.
2. We looked into a high-level difference between the Decoder’s set of operations
and those of the Encoder, as including the Encoder-Decoder Attention Layer as
part of its stack.
sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Decoder’s
sen.pushpal@gmail.com
VY86LFGKPW
Sequential Nature

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
• A Peek Into the Decoder
sen.pushpal@gmail.com
VY86LFGKPW Agenda
• The Decoder’s Sequential Nature
(Masked Self-Attention)

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
A Peek into the Decoder
Let’s assume we’re creating this Encoder-Decoder architecture for an
English-to-German Machine Translation task.

I love football. Ich liebe Fußball.


English 󰑔 German 󰎲
Also, let’s remember the Decoder operations start at the point where the pass through
the Encoder Stage has been completed.
sen.pushpal@gmail.com
VY86LFGKPW

The Encoder Stage The Decoder Stage

Input
This file is meant for personal use by sen.pushpal@gmail.com only.
I love football.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Ich

Linear & Softmax

Multiple Such Decoder Blocks

Add & Normalize Layer


Feed Forward

DECODER
Kenc-dec
sen.pushpal@gmail.com
Add & Normalize Layer
VY86LFGKPW
Venc-dec
but new Q
Encoder-Decoder Attention Layer
as usual
Add & Normalize Layer
Self-Attention Layer

We would initially pass a


Positional Encoding Start-of-Sentence (SOS) token
to the Decoder, and the
operations
This file is meant for personal ofusethe by Decoder
sen.pushpal@gmail.com only.
would
publishing output
the contents the in word
< SOS > Proprietary content. © Great Learning. All Rights Reserved. or
Sharing or part “Ich”.
full is liable for legal action.
Unauthorized use or distribution prohibited.
A Peek into the Decoder
We see immediately that most of these operations are identical to the Encoder.

Self-Attention Layer Add & Normalize Layer Feed Forward


But there are a few other operations unique to the Decoder.

Encoder-Decoder Attention Layer Linear & Softmax


sen.pushpal@gmail.com
Let’s understand
VY86LFGKPW these differences in some more detail.

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Decoder’s Sequential Nature (Masked Self-Attention)

The first difference to note is that unlike the Encoder, where all the words pass
through the Encoder block in parallel, the Decoder is Sequential in nature,
similar to how we know RNNs and LSTMs operate.

Starting with the <SOS> token, the Decoder takes a previous word & generates
sen.pushpal@gmail.com
one word at
VY86LFGKPW a time, until it understands it has generated the last word of the
sentence, in which case it generates the End of Sentence <EOS> token.

This sequential word-by-word process of the


Decoder’s text generation makes the
Decoder training stage much more time
consuming than that of the Encoder, and
more difficult to parallelize as well.
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Ich liebe Fußball < EOS >

All Decoder Operations


This characteristic of “masking” the future words / tokens and
only allowing inputs to the Decoder operations from current &
past words in each run through the Decoder, is why this process
is sometimes called Masked Self-Attention.
sen.pushpal@gmail.com Note: As seen from the animation, for each time step, not just the
VY86LFGKPW
input from that word, but the inputs of all previous words also go
into the decoder, to predict the output of that timestep.

Positional Encoding

This file is meant for personal use by sen.pushpal@gmail.com only.


< SOS > Masked
Ich the contents in part or full
Sharing or publishing Masked
liebe
is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Masked
Fußball
Summary
So in order to summarize:
1. We took a High-Level Peek inside the Decoder using a pictorial representation.
2. We understood the sequential word-by-word process of the Decoder’s text
generation through Masked Self-Attention.

sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder-Decoder
sen.pushpal@gmail.com
VY86LFGKPW
Attention Layer

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
• The Encoder-Decoder Attention Layer
sen.pushpal@gmail.com
VY86LFGKPW Agenda • The Linear & Softmax Layers - The
Language Model Head

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder-Decoder Attention Layer

The other major difference is, of course, the Encoder-Decoder Attention Layer.

Kenc-dec
Venc-dec Encoder-Decoder Attention Layer
but new Q
sen.pushpal@gmail.com
as usual
VY86LFGKPW

The difference from normal Self-Attention is that in this layer, the K and V
vectors are not generated from the input embeddings to this layer, the way
they were in the normal Self-Attention layer.

In fact, we utilize a K encoder-decoder (K


enc-dec) and a V encoder-decoder (V
enc-dec) in this layer, whose source is from the
final output of the Encoder stage.
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder-Decoder Attention Layer
It is also important to mention, that the Q for
We directly utilize the final embedding vectors < SOS > (Dec Pos 0) for example,
generated at the end of the Encoder stage, only relies on the K enc-dec & V enc-dec of
and multiply those with weight matrices to get the word “I” (Enc Pos 1) from the input, to
K enc-dec & V enc-dec. predict the word “Ich”.
These get used as K and V in this This happens for every Decoder word.
Encoder-Decoder Attention Layer.
sen.pushpal@gmail.com Kenc-dec
VY86LFGKPW Venc-dec Encoder-Decoder Attention Layer
but new
Q
as usual
It is only the Q vector that this layer creates
from the input to it, the way that normally
happens in the Self-Attention Layer (where all
three of K, Q & V are directly created from the
input embeddings to the layer).

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder-Decoder Attention Layer

This is why the Encoder-Decoder architecture is Ich heiße Jack.


actually represented as the final Encoder block Output
feeding every block in the Decoder stage.

Encoder Decoder
sen.pushpal@gmail.com
VY86LFGKPW Encoder Decoder

Encoder Kenc-dec Decoder


Venc-dec

The Modern Transformer Model


Input
My name is Jack.
The arrows from the final Encoder block to each Decoder block represent the K
enc-dec & V enc-dec from the final Encoder layer being used in the
This file
Encoder-Decoder Attention Layer ofiseach
meant Decoder
for personalblock
use by sen.pushpal@gmail.com
in the Decoder stage. only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Linear & Softmax Layers - The Language Model Head
At the end of the Decoder stage, there’s a Linear and Softmax layer that performs a
fairly simple operation needed to get the final word prediction.

Ich

Linear & Softmax


The Linear layer is merely a fully-connected layer of neurons, with a number of nodes
equalling the size of the entire vocabulary, in addition to some special tokens.
sen.pushpal@gmail.com
VY86LFGKPW

This is then fed to the final Softmax layer, which converts the numerical outputs into
probabilities, so that the word with the highest probability can be selected as the
output of the Decoder, in the style of a multi-class Classification problem.

Finally, Categorical Cross-Entropy is the loss function used for backpropagation.

This construct is called the Language Model


Head, and this is how the Decoder eventually
generates a word at each sequential time step!
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We discussed the Encoder-Decoder Attention Layer using an example in detail.
2. We understood that the Linear & Softmax Layers that perform the simple
classification operation needed to get the final word prediction.

sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Building the Transformer
sen.pushpal@gmail.com
VY86LFGKPW
in PyTorch

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Building the Transformer Model in PyTorch
sen.pushpal@gmail.com
VY86LFGKPW Agenda Hands-on

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We understood the code-based building blocks of how the Transformer model
finally comes together to generate words at each time step of the Decoder,
using PyTorch.

sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Happy Learning !
sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action. 35
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy