encode and decoder diagram explanation
encode and decoder diagram explanation
"
If the model has access to the entire sequence without any masking, it can "peek" at future words
while making predictions. This means it might see the word "mat" while predicting "sat," which
would be cheating and not representative of real-world usage.
1. Masked Multi-Head Attention:
o Self-Attention: The decoder first performs self-attention on its own output. This
allows the decoder to focus on different parts of the output sequence it has
generated so far.
o Masking: To prevent the decoder from "peeking" at future tokens in the output
sequence, a mask is applied. This ensures that the decoder only attends to previous
tokens.
o Multi-Head Attention: Multiple attention heads are used to capture different aspects
of the output sequence.
2. Encoder-Decoder Attention:
o Cross-Attention: The decoder then performs attention over the encoder's output.
This allows the decoder to align its output with the relevant parts of the input
sequence.
o Projection: A linear layer is applied to project the output of the FFN into a vector
space that matches the size of the vocabulary.
5. Softmax:
o Next Token Prediction: The token with the highest probability is selected as the next
token in the output sequence.
Each layer processing the output of the previous layer.
Iteration 1:
2. Encoder-Decoder Attention:
o The decoder uses cross-attention to attend to the encoder's output, which is the
contextualized matrix of "Hi, how are you?"
o It gathers relevant contextual information from the encoder's each layer processing
the output of the previous layer.
3. Feed-Forward Network:
4. Softmax Layer:
5. Token Selection:
o The token with the highest probability (e.g., "I'm") is selected as the next token.
Feed-Forward Network:
Role: After the self-attention and cross-attention mechanisms, the feed-forward network
(FFN) processes the outputs to transform the encoded information into the required
format.
Function: It consists of two linear layers with a ReLU activation in between. This helps in
capturing complex patterns and relationships in the data.
2. Linear Layer:
Role: The linear (or dense) layer acts as a transformation step. It maps the output of the
feed-forward network to the vocabulary size.
Function: This layer projects the high-dimensional output of the FFN to the dimension of
the vocabulary, creating a vector where each position corresponds to a token in the
vocabulary.
3. Softmax Layer:
Role: The softmax layer converts the output from the linear layer into a probability
distribution over the vocabulary.
Iteration 2:
o The input sequence now includes the previously generated token: <s> I'm
o It attends to the encoder's contextualized matrix of "Hi, how are you?" again to
gather more relevant information.
Iteration 3:
Final Iteration:
Final Output:
The final output sequence might be: "I'm good, how are you?"