0% found this document useful (0 votes)
8 views8 pages

encode and decoder diagram explanation

The document outlines the process of a decoder in a neural network, detailing the steps involved in generating output sequences using masked multi-head attention, encoder-decoder attention, and feed-forward networks. It explains how masking prevents the model from accessing future tokens, ensuring realistic predictions, and describes the iterative process of token selection based on probability distributions. The final output is generated through repeated iterations until a stopping criterion is met.

Uploaded by

vicky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

encode and decoder diagram explanation

The document outlines the process of a decoder in a neural network, detailing the steps involved in generating output sequences using masked multi-head attention, encoder-decoder attention, and feed-forward networks. It explains how masking prevents the model from accessing future tokens, ensuring realistic predictions, and describes the iterative process of token selection based on probability distributions. The final output is generated through repeated iterations until a stopping criterion is met.

Uploaded by

vicky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

"The cat sat on the mat.

"

Without Masking (Incorrect):

If the model has access to the entire sequence without any masking, it can "peek" at future words
while making predictions. This means it might see the word "mat" while predicting "sat," which
would be cheating and not representative of real-world usage.
1. Masked Multi-Head Attention:

o Self-Attention: The decoder first performs self-attention on its own output. This
allows the decoder to focus on different parts of the output sequence it has
generated so far.

o Masking: To prevent the decoder from "peeking" at future tokens in the output
sequence, a mask is applied. This ensures that the decoder only attends to previous
tokens.

o Multi-Head Attention: Multiple attention heads are used to capture different aspects
of the output sequence.

2. Encoder-Decoder Attention:

o Cross-Attention: The decoder then performs attention over the encoder's output.
This allows the decoder to align its output with the relevant parts of the input
sequence.

o Multi-Head Attention: Multiple attention heads are used to capture different


relationships between the input and output sequences.

3. Feed-Forward Network (FFN):

o Position-wise Feed-Forward Networks: Each position in the output sequence is fed


through a fully connected feed-forward network. This introduces non-linearity and
allows the model to learn complex relationships between the input and output.
4. Linear Layer:

o Projection: A linear layer is applied to project the output of the FFN into a vector
space that matches the size of the vocabulary.

5. Softmax:

o Probability Distribution: Softmax is applied to the output of the linear layer to


obtain a probability distribution over the vocabulary.

o Next Token Prediction: The token with the highest probability is selected as the next
token in the output sequence.
Each layer processing the output of the previous layer.

Iteration 1:

1. Self-Attention (Current Sequence):


o The decoder uses self-attention to focus on different parts of the generated
sequence
o (initially just <s>).

2. Encoder-Decoder Attention:

o The decoder uses cross-attention to attend to the encoder's output, which is the
contextualized matrix of "Hi, how are you?"

o It gathers relevant contextual information from the encoder's each layer processing
the output of the previous layer.

3. Feed-Forward Network:

o The combined information from the attention mechanisms is processed through a


feed-forward neural network.

4. Softmax Layer:

o The output is passed through a softmax layer to generate a probability distribution


over the vocabulary for the next token.

5. Token Selection:

o The token with the highest probability (e.g., "I'm") is selected as the next token.

Feed-Forward Network:

 Role: After the self-attention and cross-attention mechanisms, the feed-forward network
(FFN) processes the outputs to transform the encoded information into the required
format.

 Function: It consists of two linear layers with a ReLU activation in between. This helps in
capturing complex patterns and relationships in the data.

2. Linear Layer:

 Role: The linear (or dense) layer acts as a transformation step. It maps the output of the
feed-forward network to the vocabulary size.

 Function: This layer projects the high-dimensional output of the FFN to the dimension of
the vocabulary, creating a vector where each position corresponds to a token in the
vocabulary.

3. Softmax Layer:

 Role: The softmax layer converts the output from the linear layer into a probability
distribution over the vocabulary.
Iteration 2:

9. Next Input Sequence:

o The input sequence now includes the previously generated token: <s> I'm

10. Self-Attention (Current Sequence):

o The decoder focuses on the current sequence <s> I'm.

11. Encoder-Decoder Attention:

o It attends to the encoder's contextualized matrix of "Hi, how are you?" again to
gather more relevant information.

12. Feed-Forward Network:

o The output is processed through the feed-forward network.

13. Softmax Layer:

o A probability distribution is generated for the next token.

14. Token Selection:

o The token with the highest probability (e.g., "good") is selected.

Iteration 3:

15. Next Input Sequence:

o The input sequence now is: <s> I'm good.

16. Self-Attention (Current Sequence):

o The decoder focuses on the sequence <s> I'm good.

17. Encoder-Decoder Attention:

o It attends to the encoder's output again.

18. Feed-Forward Network:

o The output is processed.

19. Softmax Layer:

o A probability distribution is generated.

20. Token Selection:

o The token with the highest probability (e.g., "how") is selected.

Final Iteration:

21. Repeat Steps 9-20:


o The process repeats, generating tokens like "are", "you?" until a stopping criterion is
met (e.g., end token <\s>).

Final Output:

 The final output sequence might be: "I'm good, how are you?"

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy