0% found this document useful (0 votes)

8 views8 pages

encode and decoder diagram explanation

The document outlines the process of a decoder in a neural network, detailing the steps involved in generating output sequences using masked multi-head attention, encoder-decoder attention, and feed-forward networks. It explains how masking prevents the model from accessing future tokens, ensuring realistic predictions, and describes the iterative process of token selection based on probability distributions. The final output is generated through repeated iterations until a stopping criterion is met.

Uploaded by

vicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

encode and decoder diagram explanation

Uploaded by

vicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

"The cat sat on the mat.

Without Masking (Incorrect):

If the model has access to the entire sequence without any masking, it can "peek" at future words
while making predictions. This means it might see the word "mat" while predicting "sat," which
would be cheating and not representative of real-world usage.
1. Masked Multi-Head Attention:

o Self-Attention: The decoder first performs self-attention on its own output. This
allows the decoder to focus on different parts of the output sequence it has
generated so far.

o Masking: To prevent the decoder from "peeking" at future tokens in the output
sequence, a mask is applied. This ensures that the decoder only attends to previous
tokens.

o Multi-Head Attention: Multiple attention heads are used to capture different aspects
of the output sequence.

2. Encoder-Decoder Attention:

o Cross-Attention: The decoder then performs attention over the encoder's output.
This allows the decoder to align its output with the relevant parts of the input
sequence.

o Multi-Head Attention: Multiple attention heads are used to capture different

relationships between the input and output sequences.

3. Feed-Forward Network (FFN):

o Position-wise Feed-Forward Networks: Each position in the output sequence is fed

through a fully connected feed-forward network. This introduces non-linearity and
allows the model to learn complex relationships between the input and output.
4. Linear Layer:

o Projection: A linear layer is applied to project the output of the FFN into a vector
space that matches the size of the vocabulary.

5. Softmax:

o Probability Distribution: Softmax is applied to the output of the linear layer to

obtain a probability distribution over the vocabulary.

o Next Token Prediction: The token with the highest probability is selected as the next
token in the output sequence.
Each layer processing the output of the previous layer.

Iteration 1:

1. Self-Attention (Current Sequence):

o The decoder uses self-attention to focus on different parts of the generated
sequence
o (initially just <s>).

2. Encoder-Decoder Attention:

o The decoder uses cross-attention to attend to the encoder's output, which is the
contextualized matrix of "Hi, how are you?"

o It gathers relevant contextual information from the encoder's each layer processing
the output of the previous layer.

3. Feed-Forward Network:

o The combined information from the attention mechanisms is processed through a

feed-forward neural network.

4. Softmax Layer:

o The output is passed through a softmax layer to generate a probability distribution

over the vocabulary for the next token.

5. Token Selection:

o The token with the highest probability (e.g., "I'm") is selected as the next token.

Feed-Forward Network:

 Role: After the self-attention and cross-attention mechanisms, the feed-forward network
(FFN) processes the outputs to transform the encoded information into the required
format.

 Function: It consists of two linear layers with a ReLU activation in between. This helps in
capturing complex patterns and relationships in the data.

2. Linear Layer:

 Role: The linear (or dense) layer acts as a transformation step. It maps the output of the
feed-forward network to the vocabulary size.

 Function: This layer projects the high-dimensional output of the FFN to the dimension of
the vocabulary, creating a vector where each position corresponds to a token in the
vocabulary.

3. Softmax Layer:

 Role: The softmax layer converts the output from the linear layer into a probability
distribution over the vocabulary.
Iteration 2:

9. Next Input Sequence:

o The input sequence now includes the previously generated token: <s> I'm

10. Self-Attention (Current Sequence):

o The decoder focuses on the current sequence <s> I'm.

11. Encoder-Decoder Attention:

o It attends to the encoder's contextualized matrix of "Hi, how are you?" again to
gather more relevant information.

12. Feed-Forward Network:

o The output is processed through the feed-forward network.

13. Softmax Layer:

o A probability distribution is generated for the next token.

14. Token Selection:

o The token with the highest probability (e.g., "good") is selected.

Iteration 3:

15. Next Input Sequence:

o The input sequence now is: <s> I'm good.

16. Self-Attention (Current Sequence):

o The decoder focuses on the sequence <s> I'm good.

17. Encoder-Decoder Attention:

o It attends to the encoder's output again.

18. Feed-Forward Network:

o The output is processed.

19. Softmax Layer:

o A probability distribution is generated.

20. Token Selection:

o The token with the highest probability (e.g., "how") is selected.

Final Iteration:

21. Repeat Steps 9-20:

o The process repeats, generating tokens like "are", "you?" until a stopping criterion is
met (e.g., end token <\s>).

Final Output:

 The final output sequence might be: "I'm good, how are you?"

Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
[9,10] Transformers_3
0% (1)
[9,10] Transformers_3
92 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
One Wide Feedforward Is All You Need
No ratings yet
One Wide Feedforward Is All You Need
14 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
No ratings yet
A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
3 pages
Pride and Prejudice Annotated With Readi
No ratings yet
Pride and Prejudice Annotated With Readi
312 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
DL Notations
No ratings yet
DL Notations
5 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
chapter 2
No ratings yet
chapter 2
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Transformer_Decoder_Side
No ratings yet
Transformer_Decoder_Side
9 pages
Transformer
No ratings yet
Transformer
10 pages
Lec 7 Trans(decoder)+ViT
No ratings yet
Lec 7 Trans(decoder)+ViT
20 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
chapter_4
No ratings yet
chapter_4
24 pages
chapter 2
No ratings yet
chapter 2
11 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
16_
No ratings yet
16_
41 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Neurologic interventions for physical therapy Third Edition Kessler - eBook PDF download
100% (1)
Neurologic interventions for physical therapy Third Edition Kessler - eBook PDF download
78 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
DL CO4 PPT-1
No ratings yet
DL CO4 PPT-1
29 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Generative AI
No ratings yet
Generative AI
54 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Attention_ Attention! _ Lil'Log
No ratings yet
Attention_ Attention! _ Lil'Log
23 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
3a PPT Adolescence 9 18 Years Emotional Development
No ratings yet
3a PPT Adolescence 9 18 Years Emotional Development
30 pages
Dissertation Word Bank
100% (2)
Dissertation Word Bank
6 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
AEES Awards Report 2019-20
No ratings yet
AEES Awards Report 2019-20
32 pages
Transformer
No ratings yet
Transformer
5 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
For a Change
No ratings yet
For a Change
10 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
Transformer
No ratings yet
Transformer
31 pages
Non-Reciprocal Phase Transitions
No ratings yet
Non-Reciprocal Phase Transitions
23 pages
Debate Script
No ratings yet
Debate Script
6 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Research Needs Assessment of Teaching and Non - Teaching Personnel As Input To Research Development Plan
No ratings yet
Research Needs Assessment of Teaching and Non - Teaching Personnel As Input To Research Development Plan
6 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Gratitude
No ratings yet
Gratitude
4 pages
JOSEPH ESTRADA Written Report
No ratings yet
JOSEPH ESTRADA Written Report
8 pages
The Acquisition of Reading in Children: A Concept Paper
No ratings yet
The Acquisition of Reading in Children: A Concept Paper
7 pages
2018 Book SociallyJustReligiousAndSpirit
100% (2)
2018 Book SociallyJustReligiousAndSpirit
151 pages
CCF ClassroomSupportsCareerPathway Icebreakers 240721 182930
No ratings yet
CCF ClassroomSupportsCareerPathway Icebreakers 240721 182930
10 pages
Sandra A. Mouloudj Recognized As A Professional of The Year by Strathmore's Who's Who Worldwide Publication
No ratings yet
Sandra A. Mouloudj Recognized As A Professional of The Year by Strathmore's Who's Who Worldwide Publication
2 pages
Project Portfolio Management BDM Deck
100% (3)
Project Portfolio Management BDM Deck
19 pages
Quorom PDF
No ratings yet
Quorom PDF
2 pages
Circular - Guidelines - Summer 2020 Examination
No ratings yet
Circular - Guidelines - Summer 2020 Examination
4 pages
MANUU UMS - Student Dashboard
No ratings yet
MANUU UMS - Student Dashboard
1 page
American English Vs British English
No ratings yet
American English Vs British English
10 pages
Methods and Strategies in Teaching Engli
No ratings yet
Methods and Strategies in Teaching Engli
18 pages
Thesis - Refining Merchandising Procedure of H&M in AKH Fashion
No ratings yet
Thesis - Refining Merchandising Procedure of H&M in AKH Fashion
92 pages
Hiring Guidelines For Teacher 1 Apllicants
No ratings yet
Hiring Guidelines For Teacher 1 Apllicants
22 pages
School Governing Council (SGC) in School Planning
No ratings yet
School Governing Council (SGC) in School Planning
49 pages
Sept 22 Prof Ed
100% (1)
Sept 22 Prof Ed
31 pages
Play Review Lesson Plan
100% (1)
Play Review Lesson Plan
12 pages
LESSON PLAn of Time
No ratings yet
LESSON PLAn of Time
4 pages
LanguageCert PP1 B1 IESOL Speaking FS
No ratings yet
LanguageCert PP1 B1 IESOL Speaking FS
7 pages
Change Control Board
100% (1)
Change Control Board
4 pages
Varieties and Registers of Spoken and Written Language
No ratings yet
Varieties and Registers of Spoken and Written Language
28 pages
Venture l1 Unit 4 Recupero
No ratings yet
Venture l1 Unit 4 Recupero
2 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

encode and decoder diagram explanation

Uploaded by

encode and decoder diagram explanation

Uploaded by

"The cat sat on the mat.

Without Masking (Incorrect):

o Multi-Head Attention: Multiple attention heads are used to capture different

3. Feed-Forward Network (FFN):

o Position-wise Feed-Forward Networks: Each position in the output sequence is fed

o Probability Distribution: Softmax is applied to the output of the linear layer to

1. Self-Attention (Current Sequence):

o The combined information from the attention mechanisms is processed through a

o The output is passed through a softmax layer to generate a probability distribution

9. Next Input Sequence:

10. Self-Attention (Current Sequence):

o The decoder focuses on the current sequence <s> I'm.

11. Encoder-Decoder Attention:

12. Feed-Forward Network:

o The output is processed through the feed-forward network.

13. Softmax Layer:

o A probability distribution is generated for the next token.

14. Token Selection:

o The token with the highest probability (e.g., "good") is selected.

15. Next Input Sequence:

o The input sequence now is: <s> I'm good.

16. Self-Attention (Current Sequence):

o The decoder focuses on the sequence <s> I'm good.

17. Encoder-Decoder Attention:

o It attends to the encoder's output again.

18. Feed-Forward Network:

o The output is processed.

19. Softmax Layer:

o A probability distribution is generated.

20. Token Selection:

o The token with the highest probability (e.g., "how") is selected.

21. Repeat Steps 9-20:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.