0% found this document useful (0 votes)

8 views8 pages

encode and decoder diagram explanation

The document outlines the process of a decoder in a neural network, detailing the steps involved in generating output sequences using masked multi-head attention, encoder-decoder attention, and feed-forward networks. It explains how masking prevents the model from accessing future tokens, ensuring realistic predictions, and describes the iterative process of token selection based on probability distributions. The final output is generated through repeated iterations until a stopping criterion is met.

Uploaded by

vicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

encode and decoder diagram explanation

Uploaded by

vicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

"The cat sat on the mat.

Without Masking (Incorrect):

If the model has access to the entire sequence without any masking, it can "peek" at future words
while making predictions. This means it might see the word "mat" while predicting "sat," which
would be cheating and not representative of real-world usage.
1. Masked Multi-Head Attention:

o Self-Attention: The decoder first performs self-attention on its own output. This
allows the decoder to focus on different parts of the output sequence it has
generated so far.

o Masking: To prevent the decoder from "peeking" at future tokens in the output
sequence, a mask is applied. This ensures that the decoder only attends to previous
tokens.

o Multi-Head Attention: Multiple attention heads are used to capture different aspects
of the output sequence.

2. Encoder-Decoder Attention:

o Cross-Attention: The decoder then performs attention over the encoder's output.
This allows the decoder to align its output with the relevant parts of the input
sequence.

o Multi-Head Attention: Multiple attention heads are used to capture different

relationships between the input and output sequences.

3. Feed-Forward Network (FFN):

o Position-wise Feed-Forward Networks: Each position in the output sequence is fed

through a fully connected feed-forward network. This introduces non-linearity and
allows the model to learn complex relationships between the input and output.
4. Linear Layer:

o Projection: A linear layer is applied to project the output of the FFN into a vector
space that matches the size of the vocabulary.

5. Softmax:

o Probability Distribution: Softmax is applied to the output of the linear layer to

obtain a probability distribution over the vocabulary.

o Next Token Prediction: The token with the highest probability is selected as the next
token in the output sequence.
Each layer processing the output of the previous layer.

Iteration 1:

1. Self-Attention (Current Sequence):

o The decoder uses self-attention to focus on different parts of the generated
sequence
o (initially just <s>).

2. Encoder-Decoder Attention:

o The decoder uses cross-attention to attend to the encoder's output, which is the
contextualized matrix of "Hi, how are you?"

o It gathers relevant contextual information from the encoder's each layer processing
the output of the previous layer.

3. Feed-Forward Network:

o The combined information from the attention mechanisms is processed through a

feed-forward neural network.

4. Softmax Layer:

o The output is passed through a softmax layer to generate a probability distribution

over the vocabulary for the next token.

5. Token Selection:

o The token with the highest probability (e.g., "I'm") is selected as the next token.

Feed-Forward Network:

 Role: After the self-attention and cross-attention mechanisms, the feed-forward network
(FFN) processes the outputs to transform the encoded information into the required
format.

 Function: It consists of two linear layers with a ReLU activation in between. This helps in
capturing complex patterns and relationships in the data.

2. Linear Layer:

 Role: The linear (or dense) layer acts as a transformation step. It maps the output of the
feed-forward network to the vocabulary size.

 Function: This layer projects the high-dimensional output of the FFN to the dimension of
the vocabulary, creating a vector where each position corresponds to a token in the
vocabulary.

3. Softmax Layer:

 Role: The softmax layer converts the output from the linear layer into a probability
distribution over the vocabulary.
Iteration 2:

9. Next Input Sequence:

o The input sequence now includes the previously generated token: <s> I'm

10. Self-Attention (Current Sequence):

o The decoder focuses on the current sequence <s> I'm.

11. Encoder-Decoder Attention:

o It attends to the encoder's contextualized matrix of "Hi, how are you?" again to
gather more relevant information.

12. Feed-Forward Network:

o The output is processed through the feed-forward network.

13. Softmax Layer:

o A probability distribution is generated for the next token.

14. Token Selection:

o The token with the highest probability (e.g., "good") is selected.

Iteration 3:

15. Next Input Sequence:

o The input sequence now is: <s> I'm good.

16. Self-Attention (Current Sequence):

o The decoder focuses on the sequence <s> I'm good.

17. Encoder-Decoder Attention:

o It attends to the encoder's output again.

18. Feed-Forward Network:

o The output is processed.

19. Softmax Layer:

o A probability distribution is generated.

20. Token Selection:

o The token with the highest probability (e.g., "how") is selected.

Final Iteration:

21. Repeat Steps 9-20:

o The process repeats, generating tokens like "are", "you?" until a stopping criterion is
met (e.g., end token <\s>).

Final Output:

 The final output sequence might be: "I'm good, how are you?"

Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
[9,10] Transformers_3
0% (1)
[9,10] Transformers_3
92 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
One Wide Feedforward Is All You Need
No ratings yet
One Wide Feedforward Is All You Need
14 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
No ratings yet
A Discourse-Aware Attention Model For Abstractive Summarization of Long Documents - SUMMARY
3 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
DL Notations
No ratings yet
DL Notations
5 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
chapter 2
No ratings yet
chapter 2
11 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Transformer_Decoder_Side
No ratings yet
Transformer_Decoder_Side
9 pages
SAP SuccessFactors Employee Central Academy - Notes
89% (9)
SAP SuccessFactors Employee Central Academy - Notes
238 pages
Transformer
No ratings yet
Transformer
10 pages
Lec 7 Trans(decoder)+ViT
No ratings yet
Lec 7 Trans(decoder)+ViT
20 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Digital Signature under I T Act 2000
No ratings yet
Digital Signature under I T Act 2000
11 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
chapter_4
No ratings yet
chapter_4
24 pages
chapter 2
No ratings yet
chapter 2
11 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
16_
No ratings yet
16_
41 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
DL CO4 PPT-1
No ratings yet
DL CO4 PPT-1
29 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Generative AI
No ratings yet
Generative AI
54 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Computer Science - V: Disadvantages of Database System Are
No ratings yet
Computer Science - V: Disadvantages of Database System Are
13 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Attention_ Attention! _ Lil'Log
No ratings yet
Attention_ Attention! _ Lil'Log
23 pages
8086 Datasheet
0% (1)
8086 Datasheet
30 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Transformer
No ratings yet
Transformer
5 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
For a Change
No ratings yet
For a Change
10 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
1 Test Cases Form MSBTE Theory Exam (STE 17624 and STE 22518)
No ratings yet
1 Test Cases Form MSBTE Theory Exam (STE 17624 and STE 22518)
15 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
Unit3&4 Network Theorams
No ratings yet
Unit3&4 Network Theorams
73 pages
Transformer
No ratings yet
Transformer
31 pages
MOS Microsoft Excel 2019 Latihan - Exam Preparation (Eng) - Exercise
No ratings yet
MOS Microsoft Excel 2019 Latihan - Exam Preparation (Eng) - Exercise
165 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Activity 3 Symbolic Processing 2024
No ratings yet
Activity 3 Symbolic Processing 2024
9 pages
RFT For PCS Consultancy and Implementation Services1
No ratings yet
RFT For PCS Consultancy and Implementation Services1
17 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Advanced Data Structure - 1
No ratings yet
Advanced Data Structure - 1
16 pages
Eye Movement Based Human Computer Interaction Techniques: Towards Non-Command Interface
No ratings yet
Eye Movement Based Human Computer Interaction Techniques: Towards Non-Command Interface
29 pages
Administrative Manager Resume
100% (1)
Administrative Manager Resume
8 pages
Lower Bounds Probabilistic Arguments: With All
No ratings yet
Lower Bounds Probabilistic Arguments: With All
9 pages
Pair Sum Labeling of Some Special Graphs
No ratings yet
Pair Sum Labeling of Some Special Graphs
3 pages
Latest Operating System For The CS6X and CS6R
No ratings yet
Latest Operating System For The CS6X and CS6R
3 pages
WhitePaper 5G User Registration For Dual Access Dual Connectivity March2019
No ratings yet
WhitePaper 5G User Registration For Dual Access Dual Connectivity March2019
26 pages
Fm-Ro-69-03 - Mandatory Random Drug Testing Acknowledgement Form (Minors)
No ratings yet
Fm-Ro-69-03 - Mandatory Random Drug Testing Acknowledgement Form (Minors)
1 page
AI 時尚行業
No ratings yet
AI 時尚行業
99 pages
11th Annual Maths 2024
No ratings yet
11th Annual Maths 2024
10 pages
Vdocuments - MX - Limba Coreeana Metoda Larousse
No ratings yet
Vdocuments - MX - Limba Coreeana Metoda Larousse
84 pages
Solar Tracking System
No ratings yet
Solar Tracking System
51 pages
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
No ratings yet
What's Next?: Tree Models Decision Trees Ranking and Probability Estimation Trees
49 pages
Omron Encoder
No ratings yet
Omron Encoder
12 pages
Smart Home Specifications (OPTIONAL)
No ratings yet
Smart Home Specifications (OPTIONAL)
23 pages
Taraes de Marzo - 1a-Q3
100% (1)
Taraes de Marzo - 1a-Q3
45 pages
Project On New School
No ratings yet
Project On New School
10 pages
Z Transforms
No ratings yet
Z Transforms
25 pages
Lab 1
No ratings yet
Lab 1
18 pages
Google Cloud Platform Fundamentals - Core Infrastructure
No ratings yet
Google Cloud Platform Fundamentals - Core Infrastructure
1 page
NumericalRelaySelectionTable 756179 ENy
No ratings yet
NumericalRelaySelectionTable 756179 ENy
1 page
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

encode and decoder diagram explanation

Uploaded by

encode and decoder diagram explanation

Uploaded by

"The cat sat on the mat.

Without Masking (Incorrect):

o Multi-Head Attention: Multiple attention heads are used to capture different

3. Feed-Forward Network (FFN):

o Position-wise Feed-Forward Networks: Each position in the output sequence is fed

o Probability Distribution: Softmax is applied to the output of the linear layer to

1. Self-Attention (Current Sequence):

o The combined information from the attention mechanisms is processed through a

o The output is passed through a softmax layer to generate a probability distribution

9. Next Input Sequence:

10. Self-Attention (Current Sequence):

o The decoder focuses on the current sequence <s> I'm.

11. Encoder-Decoder Attention:

12. Feed-Forward Network:

o The output is processed through the feed-forward network.

13. Softmax Layer:

o A probability distribution is generated for the next token.

14. Token Selection:

o The token with the highest probability (e.g., "good") is selected.

15. Next Input Sequence:

o The input sequence now is: <s> I'm good.

16. Self-Attention (Current Sequence):

o The decoder focuses on the sequence <s> I'm good.

17. Encoder-Decoder Attention:

o It attends to the encoder's output again.

18. Feed-Forward Network:

o The output is processed.

19. Softmax Layer:

o A probability distribution is generated.

20. Token Selection:

o The token with the highest probability (e.g., "how") is selected.

21. Repeat Steps 9-20:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.