The Decoder: Deconstructed
The Decoder: Deconstructed
com
VY86LFGKPW
The Decoder Deconstructed
A broad overview of
A deep-dive into
modern NLP
Decoders and how
Transformer
they complete the
Architectures
Transformer model
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model Caricature
Now that we’ve understood the Encoder block, Ich heiße Jack.
let’s zoom out and look at a high-level Output
caricature of the modern Transformer model.
The way this would work is, an input sequence is first passed to the
Encoder stage of the Transformer.
sen.pushpal@gmail.com
VY86LFGKPW
In reality, the Encoder and Decoder stage each Ich heiße Jack.
comprise of several individual blocks of Output
Encoders and Decoders.
Encoder Decoder
sen.pushpal@gmail.com
VY86LFGKPW Encoder Decoder
Encoder Decoder
Self-Attention Layer
X1 X2 X3
Positional Encoding
This should give a sense for the level of complexity we’re dealing with in Transformer models.
The original Transformer architecture from 2017, proposed 6 Encoders & 6 Decoders!
So, training multiple such blocks in a Transformer architecture, like the caricature shown,
makes Transformer models far more complex than previous Neural Networks.
This is why it is not even feasible in many cases, to train large Transformer models (which
sen.pushpal@gmail.com
VY86LFGKPW
could have 12 or 24 such blocks) on big datasets with just individual computers.
sen.pushpal@gmail.com
VY86LFGKPW Named Entity Recognition (NER) and
Text Classification are different from
NLP Text
NER
Problems Classification the other four categories of NLP
problems in this chart.
They are Encoder-oriented tasks.
Text Question
Summarization Answering
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NLP Problem Categories - Encoder-oriented Tasks
This is because, NER and Text Classification rely only on the high-quality embeddings
generated for each word by the Encoder stage of a Transformer. These embeddings
are sufficient for them to perform the classification they require.
The Decoder, however, is relevant when the NLP task involves generating text.
Generating text is the Decoder’s value addition to the Transformer model.
Any modern Transformer model that generates text, such as GPT-3 or ChatGPT (a
“Generative” Transformer), is likely using a Decoder to do so.
sen.pushpal@gmail.com
VY86LFGKPW
The Decoder, then is relevant while making predictions only for the other four NLP
problem categories in the diagram:
With that context, let’s now dive deeper into the working of the
Decoder block and deconstruct its operations, to understand
how it generates text and how that differs from what the Encoder
is doing.
Decoder Encoder-Decoder
Attention Layer
Self-Attention Layer
So in order to summarize:
1. We reviewed various NLP problem categories, and which of them are relevant
for a Decoder-oriented architecture.
2. We looked into a high-level difference between the Decoder’s set of operations
and those of the Encoder, as including the Encoder-Decoder Attention Layer as
part of its stack.
sen.pushpal@gmail.com
VY86LFGKPW
Input
This file is meant for personal use by sen.pushpal@gmail.com only.
I love football.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Ich
DECODER
Kenc-dec
sen.pushpal@gmail.com
Add & Normalize Layer
VY86LFGKPW
Venc-dec
but new Q
Encoder-Decoder Attention Layer
as usual
Add & Normalize Layer
Self-Attention Layer
The first difference to note is that unlike the Encoder, where all the words pass
through the Encoder block in parallel, the Decoder is Sequential in nature,
similar to how we know RNNs and LSTMs operate.
Starting with the <SOS> token, the Decoder takes a previous word & generates
sen.pushpal@gmail.com
one word at
VY86LFGKPW a time, until it understands it has generated the last word of the
sentence, in which case it generates the End of Sentence <EOS> token.
Positional Encoding
sen.pushpal@gmail.com
VY86LFGKPW
The other major difference is, of course, the Encoder-Decoder Attention Layer.
Kenc-dec
Venc-dec Encoder-Decoder Attention Layer
but new Q
sen.pushpal@gmail.com
as usual
VY86LFGKPW
The difference from normal Self-Attention is that in this layer, the K and V
vectors are not generated from the input embeddings to this layer, the way
they were in the normal Self-Attention layer.
Encoder Decoder
sen.pushpal@gmail.com
VY86LFGKPW Encoder Decoder
Ich
This is then fed to the final Softmax layer, which converts the numerical outputs into
probabilities, so that the word with the highest probability can be selected as the
output of the Decoder, in the style of a multi-class Classification problem.
sen.pushpal@gmail.com
VY86LFGKPW
sen.pushpal@gmail.com
VY86LFGKPW