0% found this document useful (0 votes)

14 views35 pages

The Decoder: Deconstructed

The document provides an in-depth exploration of the Transformer Decoder block, detailing its operations and the architecture of the Transformer model, which includes multiple Encoder and Decoder blocks. It discusses the sequential nature of the Decoder, its role in text generation tasks, and the differences between Encoder-oriented and Decoder-oriented NLP tasks. The document emphasizes the complexity of Transformer models and the importance of understanding both Encoder and Decoder functionalities for applications like machine translation and text summarization.

Uploaded by

krish.bonzo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views35 pages

The Decoder: Deconstructed

Uploaded by

krish.bonzo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

sen.pushpal@gmail.

com
VY86LFGKPW
The Decoder Deconstructed

Understanding the operations & intuition

behind the Transformer Decoder block

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
• The Transformer Model Caricature
sen.pushpal@gmail.com
VY86LFGKPW Agenda • The Multiple Blocks inside the Transformer
• The Transformer Encoder Block - A Recap

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Learning Journey
Utilizing PyTorch to Applying
understand the functions Encoder-Decoder
& classes to build an Transformer models to
Encoder-Decoder Machine Translation, and
Transformer from scratch fine-tuning on local corpus

Building a Transformer Model Encoder-Decoder Translation

sen.pushpal@gmail.com Machine Translation Application

VY86LFGKPW Decoder Implementation

Decoders & Transformers Modern Transformer LLMs

A Conceptual Understanding Transformer Architectures

A broad overview of
A deep-dive into
modern NLP
Decoders and how
Transformer
they complete the
Architectures
Transformer model
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model Caricature

Now that we’ve understood the Encoder block, Ich heiße Jack.
let’s zoom out and look at a high-level Output
caricature of the modern Transformer model.

sen.pushpal@gmail.com Encoder Decoder

VY86LFGKPW Stage
Stage

The Modern Transformer Model

Input
My name is Jack.
Note: This Transformer architecture would typically be used for a
Machine Translation task, This
such file is
asmeant for personal
English use by sen.pushpal@gmail.com
to German translation. only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model - High-level Flow
An Encoder-Decoder style architecture is typically used in this type of NLP task, where an
input sequence and output sequence are both required, and the output may be very
different from the input. This is the case for tasks like Translation and Question Answering.

The way this would work is, an input sequence is first passed to the
Encoder stage of the Transformer.

sen.pushpal@gmail.com
VY86LFGKPW

The Encoder stage’s operations eventually compute a high-quality

representation of the input sequence, which has captured its syntactical
& semantic meaning.

The Decoder stage is responsible for eventually “decoding” this

representation to a different sentence, in other words, converting it to the
output needed for a task like Machine Translation or Question Answering.

But now, let’s go deeper into what

This file the
is meant “stages”
for personal here
use by refer to.
sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model - Multiple Blocks

In reality, the Encoder and Decoder stage each Ich heiße Jack.
comprise of several individual blocks of Output
Encoders and Decoders.

Encoder Decoder
sen.pushpal@gmail.com
VY86LFGKPW Encoder Decoder

Encoder Decoder

The Modern Transformer Model

Input
My name is Jack.
Note: As we’ve seen already, each Encoder block itself is a
Neural Network architecture with
This file multiple
is meant transformations.
for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Encoder Block - A Recap

Add & Normalize Layer And all of this is just

one Encoder block!
Feed Feed Feed
Forward Forward Forward
Z1 Z2 Z3
ENCODER

sen.pushpal@gmail.com Add & Normalize Layer

VY86LFGKPW
Z1 Z2 Z3

Self-Attention Layer
X1 X2 X3

Positional Encoding

I This file isLOVE FOOTBALL

meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Transformer Model

This should give a sense for the level of complexity we’re dealing with in Transformer models.

The original Transformer architecture from 2017, proposed 6 Encoders & 6 Decoders!

So, training multiple such blocks in a Transformer architecture, like the caricature shown,
makes Transformer models far more complex than previous Neural Networks.
This is why it is not even feasible in many cases, to train large Transformer models (which
sen.pushpal@gmail.com
VY86LFGKPW
could have 12 or 24 such blocks) on big datasets with just individual computers.

In the upcoming slides, we shall attempt

to understand the Decoder block of the
Transformer, which has many similarities
with the Encoder, so that we can then
abstract out and speak about the
modern large Transformer architectures
used in NLP.
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We discussed the Modern Transformer architecture for NLP, where the input
sequence is passed to the Encoder stage, the representation of the input
computed from the Encoder stage is then passed to the Decoder stage, and
the Decoder is what eventually gives the output needed for the task.
2. We discussed the Transformer Model with multiple blocks, where each stage of
modern Transformer models consist of several individual blocks of Encoders
sen.pushpal@gmail.com
VY86LFGKPW
and Decoders.
3. We discussed the Transformer Encoder Block, where each Encoder block (&
Decoder Block) itself is a Neural Network architecture that consists of multiple
linear & non-linear transformations and stacking multiple such blocks in an
architecture is what gives the non-linear complexity of the Transformer.

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
• NLP Problem Categories for the Decoder
sen.pushpal@gmail.com
VY86LFGKPW Agenda
• The Encoder vs. the Decoder

This file is meant for personal use by sen.pushpal@gmail.com only.

Before we dive in, let’s understand the

Text
Language Generation &
different NLP problem categories and how
Translation Completion they have different requirements for the
Transformer architectures needed to build
solutions for them.

sen.pushpal@gmail.com
VY86LFGKPW Named Entity Recognition (NER) and
Text Classification are different from
NLP Text
NER
Problems Classification the other four categories of NLP
problems in this chart.
They are Encoder-oriented tasks.

Text Question
Summarization Answering
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
NLP Problem Categories - Encoder-oriented Tasks

This is because, NER and Text Classification rely only on the high-quality embeddings
generated for each word by the Encoder stage of a Transformer. These embeddings
are sufficient for them to perform the classification they require.

So, NER & Text Classification can be solved by training an Encoder-Decoder

Transformer to predict the missing word in a sentence (let’s say). This training process
sen.pushpal@gmail.com
VY86LFGKPW
eventually enables the Encoder to create the high-quality embeddings needed for
classification-oriented tasks.

During testing time, NER or Text Classification

would purely use the Encoder part of the
architecture to make the predictions needed.

The Decoder is not required while testing.

This file is meant for personal use by sen.pushpal@gmail.com only.

The Decoder, however, is relevant when the NLP task involves generating text.
Generating text is the Decoder’s value addition to the Transformer model.
Any modern Transformer model that generates text, such as GPT-3 or ChatGPT (a
“Generative” Transformer), is likely using a Decoder to do so.

sen.pushpal@gmail.com
VY86LFGKPW
The Decoder, then is relevant while making predictions only for the other four NLP
problem categories in the diagram:

Text Generation | Question Answering | Text Summarization | Machine Translation

With that context, let’s now dive deeper into the working of the
Decoder block and deconstruct its operations, to understand
how it generates text and how that differs from what the Encoder
is doing.

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder vs. The Decoder
At a high level, the Decoder only slightly differs from the constitution of the Encoder.

Feed Forward Layer This additional

Encoder-Decoder
Encoder Attention Layer of the
Decoder, allows the
Decoder to specifically
Self-Attention Layer focus on certain parts of
sen.pushpal@gmail.com
VY86LFGKPW the input it receives from
the Encoder, in order to
generate its output.
Feed Forward Layer

Decoder Encoder-Decoder
Attention Layer
Self-Attention Layer

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary

So in order to summarize:
1. We reviewed various NLP problem categories, and which of them are relevant
for a Decoder-oriented architecture.
2. We looked into a high-level difference between the Decoder’s set of operations
and those of the Encoder, as including the Encoder-Decoder Attention Layer as
part of its stack.
sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
• A Peek Into the Decoder
sen.pushpal@gmail.com
VY86LFGKPW Agenda
• The Decoder’s Sequential Nature
(Masked Self-Attention)

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
A Peek into the Decoder
Let’s assume we’re creating this Encoder-Decoder architecture for an
English-to-German Machine Translation task.

I love football. Ich liebe Fußball.

English 󰑔 German 󰎲
Also, let’s remember the Decoder operations start at the point where the pass through
the Encoder Stage has been completed.
sen.pushpal@gmail.com
VY86LFGKPW

The Encoder Stage The Decoder Stage

Input
This file is meant for personal use by sen.pushpal@gmail.com only.
I love football.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Ich

Linear & Softmax

Multiple Such Decoder Blocks

Add & Normalize Layer

Feed Forward

DECODER
Kenc-dec
sen.pushpal@gmail.com
Add & Normalize Layer
VY86LFGKPW
Venc-dec
but new Q
Encoder-Decoder Attention Layer
as usual
Add & Normalize Layer
Self-Attention Layer

We would initially pass a

Positional Encoding Start-of-Sentence (SOS) token
to the Decoder, and the
operations
This file is meant for personal ofusethe by Decoder
sen.pushpal@gmail.com only.
would
publishing output
the contents the in word
< SOS > Proprietary content. © Great Learning. All Rights Reserved. or
Sharing or part “Ich”.
full is liable for legal action.
Unauthorized use or distribution prohibited.
A Peek into the Decoder
We see immediately that most of these operations are identical to the Encoder.

Self-Attention Layer Add & Normalize Layer Feed Forward

But there are a few other operations unique to the Decoder.

Encoder-Decoder Attention Layer Linear & Softmax

sen.pushpal@gmail.com
Let’s understand
VY86LFGKPW these differences in some more detail.

This file is meant for personal use by sen.pushpal@gmail.com only.

The first difference to note is that unlike the Encoder, where all the words pass
through the Encoder block in parallel, the Decoder is Sequential in nature,
similar to how we know RNNs and LSTMs operate.

Starting with the <SOS> token, the Decoder takes a previous word & generates
sen.pushpal@gmail.com
one word at
VY86LFGKPW a time, until it understands it has generated the last word of the
sentence, in which case it generates the End of Sentence <EOS> token.

This sequential word-by-word process of the

Decoder’s text generation makes the
Decoder training stage much more time
consuming than that of the Encoder, and
more difficult to parallelize as well.
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Ich liebe Fußball < EOS >

All Decoder Operations

This characteristic of “masking” the future words / tokens and
only allowing inputs to the Decoder operations from current &
past words in each run through the Decoder, is why this process
is sometimes called Masked Self-Attention.
sen.pushpal@gmail.com Note: As seen from the animation, for each time step, not just the
VY86LFGKPW
input from that word, but the inputs of all previous words also go
into the decoder, to predict the output of that timestep.

Positional Encoding

This file is meant for personal use by sen.pushpal@gmail.com only.

< SOS > Masked
Ich the contents in part or full
Sharing or publishing Masked
liebe
is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Masked
Fußball
Summary
So in order to summarize:
1. We took a High-Level Peek inside the Decoder using a pictorial representation.
2. We understood the sequential word-by-word process of the Decoder’s text
generation through Masked Self-Attention.

sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
• The Encoder-Decoder Attention Layer
sen.pushpal@gmail.com
VY86LFGKPW Agenda • The Linear & Softmax Layers - The
Language Model Head

This file is meant for personal use by sen.pushpal@gmail.com only.

The other major difference is, of course, the Encoder-Decoder Attention Layer.

Kenc-dec
Venc-dec Encoder-Decoder Attention Layer
but new Q
sen.pushpal@gmail.com
as usual
VY86LFGKPW

The difference from normal Self-Attention is that in this layer, the K and V
vectors are not generated from the input embeddings to this layer, the way
they were in the normal Self-Attention layer.

In fact, we utilize a K encoder-decoder (K

enc-dec) and a V encoder-decoder (V
enc-dec) in this layer, whose source is from the
final output of the Encoder stage.
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Encoder-Decoder Attention Layer
It is also important to mention, that the Q for
We directly utilize the final embedding vectors < SOS > (Dec Pos 0) for example,
generated at the end of the Encoder stage, only relies on the K enc-dec & V enc-dec of
and multiply those with weight matrices to get the word “I” (Enc Pos 1) from the input, to
K enc-dec & V enc-dec. predict the word “Ich”.
These get used as K and V in this This happens for every Decoder word.
Encoder-Decoder Attention Layer.
sen.pushpal@gmail.com Kenc-dec
VY86LFGKPW Venc-dec Encoder-Decoder Attention Layer
but new
Q
as usual
It is only the Q vector that this layer creates
from the input to it, the way that normally
happens in the Self-Attention Layer (where all
three of K, Q & V are directly created from the
input embeddings to the layer).

This file is meant for personal use by sen.pushpal@gmail.com only.

This is why the Encoder-Decoder architecture is Ich heiße Jack.

actually represented as the final Encoder block Output
feeding every block in the Decoder stage.

Encoder Decoder
sen.pushpal@gmail.com
VY86LFGKPW Encoder Decoder

Encoder Kenc-dec Decoder

Venc-dec

The Modern Transformer Model

Input
My name is Jack.
The arrows from the final Encoder block to each Decoder block represent the K
enc-dec & V enc-dec from the final Encoder layer being used in the
This file
Encoder-Decoder Attention Layer ofiseach
meant Decoder
for personalblock
use by sen.pushpal@gmail.com
in the Decoder stage. only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Linear & Softmax Layers - The Language Model Head
At the end of the Decoder stage, there’s a Linear and Softmax layer that performs a
fairly simple operation needed to get the final word prediction.

Ich

Linear & Softmax

The Linear layer is merely a fully-connected layer of neurons, with a number of nodes
equalling the size of the entire vocabulary, in addition to some special tokens.
sen.pushpal@gmail.com
VY86LFGKPW

This is then fed to the final Softmax layer, which converts the numerical outputs into
probabilities, so that the word with the highest probability can be selected as the
output of the Decoder, in the style of a multi-class Classification problem.

Finally, Categorical Cross-Entropy is the loss function used for backpropagation.

This construct is called the Language Model

Head, and this is how the Decoder eventually
generates a word at each sequential time step!
This file is meant for personal use by sen.pushpal@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We discussed the Encoder-Decoder Attention Layer using an example in detail.
2. We understood that the Linear & Softmax Layers that perform the simple
classification operation needed to get the final word prediction.

sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We understood the code-based building blocks of how the Transformer model
finally comes together to generate words at each time step of the Decoder,
using PyTorch.

sen.pushpal@gmail.com
VY86LFGKPW

This file is meant for personal use by sen.pushpal@gmail.com only.

An Introduction To Academic Writing
100% (2)
An Introduction To Academic Writing
14 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time.
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time.
5 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Pub - Basic English Grammar 12 Tenses PDF
100% (2)
Pub - Basic English Grammar 12 Tenses PDF
16 pages
Transformers
No ratings yet
Transformers
12 pages
Week 12
100% (1)
Week 12
64 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
Transformer Architecture Explained
No ratings yet
Transformer Architecture Explained
8 pages
SSC CGL Tier 1 Answer Key - Download SSC CGL Morning Shift - Evening Shift Answer Key 2015 Here
100% (1)
SSC CGL Tier 1 Answer Key - Download SSC CGL Morning Shift - Evening Shift Answer Key 2015 Here
11 pages
Computer Vision 12 Vision Language Models
No ratings yet
Computer Vision 12 Vision Language Models
56 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
20 pages
General Psychology Reviewer 2 PDF
100% (1)
General Psychology Reviewer 2 PDF
15 pages
DL Co4 PPT-1
No ratings yet
DL Co4 PPT-1
29 pages
Basic IELTS Task 2 Writing Template Structure
100% (4)
Basic IELTS Task 2 Writing Template Structure
3 pages
(W) Project On Impact of Motivation On Employee Performance
No ratings yet
(W) Project On Impact of Motivation On Employee Performance
100 pages
Transformers Explained Visually (Part 2) - How It Works, Step-By-step - by Ketan Doshi - Towards Data Science
No ratings yet
Transformers Explained Visually (Part 2) - How It Works, Step-By-step - by Ketan Doshi - Towards Data Science
30 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
LA2 Presentation
No ratings yet
LA2 Presentation
21 pages
Challenges in NMT - 2004.05809
No ratings yet
Challenges in NMT - 2004.05809
22 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Encoder Vs Decoder Transformer Updated
No ratings yet
Encoder Vs Decoder Transformer Updated
10 pages
Transformers
No ratings yet
Transformers
23 pages
The Theory of Qualitative Research
No ratings yet
The Theory of Qualitative Research
37 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
Ecstasy (Emotion) - Wikipedia
No ratings yet
Ecstasy (Emotion) - Wikipedia
2 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Encoder Decoder
No ratings yet
Encoder Decoder
8 pages
Transformers - Introduction
No ratings yet
Transformers - Introduction
22 pages
Beauty Is in The Ease of The Beholding A Neurophysiological PDF
No ratings yet
Beauty Is in The Ease of The Beholding A Neurophysiological PDF
16 pages
A Study of The Effectiveness of Case Study Approach in Software Engineering Education
No ratings yet
A Study of The Effectiveness of Case Study Approach in Software Engineering Education
8 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Importance of Information Literacy: M.Muniya Naik
No ratings yet
Importance of Information Literacy: M.Muniya Naik
9 pages
2503 06594v1-LaMaTE
No ratings yet
2503 06594v1-LaMaTE
36 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
2022-Markowitz-Transformers, Explained - Understand The Model Behind GPT-3, BERT, and T5
No ratings yet
2022-Markowitz-Transformers, Explained - Understand The Model Behind GPT-3, BERT, and T5
11 pages
Business Skills & Competencies Checklist
No ratings yet
Business Skills & Competencies Checklist
3 pages
15.chapter11 NLPApplications
No ratings yet
15.chapter11 NLPApplications
25 pages
Tgfu Locomotion
No ratings yet
Tgfu Locomotion
4 pages
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
No ratings yet
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
4 pages
Apc 40 Apc210116
No ratings yet
Apc 40 Apc210116
8 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
Transformer Networks
No ratings yet
Transformer Networks
53 pages
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
No ratings yet
How Transformers Work - A Detailed Exploration of Transformer Architecture - DataCamp
19 pages
Group Factor Theory
No ratings yet
Group Factor Theory
4 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
The Routledge Companion To Philosophy of Psychology Sarah Robins - Download The Ebook Today and Own The Complete Version
No ratings yet
The Routledge Companion To Philosophy of Psychology Sarah Robins - Download The Ebook Today and Own The Complete Version
70 pages
Chapter 1: Transformers: The New Era of NLP
No ratings yet
Chapter 1: Transformers: The New Era of NLP
2 pages
Transformers: Attention Is All You Need
No ratings yet
Transformers: Attention Is All You Need
54 pages
ATA Glossary Extended Version 2 - Mario Sikora
No ratings yet
ATA Glossary Extended Version 2 - Mario Sikora
23 pages
Understanding Transformer Model Architectures - Practical Artificial Intelligence
No ratings yet
Understanding Transformer Model Architectures - Practical Artificial Intelligence
6 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Advanced Techniques in Training and Applying Large Language Models
No ratings yet
Advanced Techniques in Training and Applying Large Language Models
6 pages
Lesson Plan For Demo 2
No ratings yet
Lesson Plan For Demo 2
2 pages
Transformers
No ratings yet
Transformers
2 pages
Subordinate Clauses: A Clause Is A Group of Words That Could Be A Sentence
No ratings yet
Subordinate Clauses: A Clause Is A Group of Words That Could Be A Sentence
11 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Ev3 Programming Lesson Plan ENUS
75% (4)
Ev3 Programming Lesson Plan ENUS
32 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
JioDiscover-What Is The Neural Networ
No ratings yet
JioDiscover-What Is The Neural Networ
5 pages
Generative AI
No ratings yet
Generative AI
54 pages
JUNE 06 - S11-12PS-IIIa-1 New
No ratings yet
JUNE 06 - S11-12PS-IIIa-1 New
1 page
Snyders Hope Theory
No ratings yet
Snyders Hope Theory
1 page
Transformers
No ratings yet
Transformers
27 pages
Transformers
No ratings yet
Transformers
10 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Av6 - Day 3 - Writing Task 1
No ratings yet
Av6 - Day 3 - Writing Task 1
11 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Transformer
No ratings yet
Transformer
31 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
GenAI Syllabus
No ratings yet
GenAI Syllabus
17 pages
SUSIE Pharmaceutical CMC Ontology-Based Information Extraction For Drug Development Using Machine Learning
No ratings yet
SUSIE Pharmaceutical CMC Ontology-Based Information Extraction For Drug Development Using Machine Learning
15 pages
Lesson Plan in Science Iv Quarter 4, Week 2, School Year 2021-2022
No ratings yet
Lesson Plan in Science Iv Quarter 4, Week 2, School Year 2021-2022
6 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
En8Lt-Iiid-2.2: En8Rc-Iiid-12: En8Rc-Iiie-2.1.7
No ratings yet
En8Lt-Iiid-2.2: En8Rc-Iiid-12: En8Rc-Iiie-2.1.7
8 pages
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
MFL English FINAL German
No ratings yet
MFL English FINAL German
44 pages
Oral Communication in Context: Quarter 2 - Module 3: Principles of Effective Speech Writing and Delivery
100% (2)
Oral Communication in Context: Quarter 2 - Module 3: Principles of Effective Speech Writing and Delivery
36 pages
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
PMCF Ni AMA 22-23
75% (4)
PMCF Ni AMA 22-23
4 pages
Hacks..
From Everand
Hacks..
Hunter Davis
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

The Decoder: Deconstructed

Uploaded by

The Decoder: Deconstructed

Uploaded by

sen.pushpal@gmail.

Understanding the operations & intuition

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

Building a Transformer Model Encoder-Decoder Translation

sen.pushpal@gmail.com Machine Translation Application

Decoders & Transformers Modern Transformer LLMs

A Conceptual Understanding Transformer Architectures

sen.pushpal@gmail.com Encoder Decoder

The Modern Transformer Model

The Encoder stage’s operations eventually compute a high-quality

The Decoder stage is responsible for eventually “decoding” this

But now, let’s go deeper into what

The Modern Transformer Model

Add & Normalize Layer And all of this is just

sen.pushpal@gmail.com Add & Normalize Layer

I This file isLOVE FOOTBALL

In the upcoming slides, we shall attempt

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

Before we dive in, let’s understand the

So, NER & Text Classification can be solved by training an Encoder-Decoder

During testing time, NER or Text Classification

The Decoder is not required while testing.

This file is meant for personal use by sen.pushpal@gmail.com only.

Text Generation | Question Answering | Text Summarization | Machine Translation

This file is meant for personal use by sen.pushpal@gmail.com only.

Feed Forward Layer This additional

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

I love football. Ich liebe Fußball.

The Encoder Stage The Decoder Stage

Linear & Softmax

Multiple Such Decoder Blocks

Add & Normalize Layer

We would initially pass a

Self-Attention Layer Add & Normalize Layer Feed Forward

Encoder-Decoder Attention Layer Linear & Softmax

This file is meant for personal use by sen.pushpal@gmail.com only.

This sequential word-by-word process of the

All Decoder Operations

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

In fact, we utilize a K encoder-decoder (K

This file is meant for personal use by sen.pushpal@gmail.com only.

This is why the Encoder-Decoder architecture is Ich heiße Jack.

Encoder Kenc-dec Decoder

The Modern Transformer Model

Linear & Softmax

Finally, Categorical Cross-Entropy is the loss function used for backpropagation.

This construct is called the Language Model

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

This file is meant for personal use by sen.pushpal@gmail.com only.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.