Transformer
Transformer
Introduction to Transformers
• Before Transformers, models like RNNs (Recurrent Neural Networks) and LSTMs
(Long Short-Term Memory) were used for sequence-based tasks like language
translation. However, these models processed data sequentially, which made training
slow and prone to the vanishing gradient problem (where gradients become too
small, preventing the model from learning over long sequences).
Example: If you're trying to translate a long sentence, early words in the sentence may lose
influence over time. For instance, "The cat sitting by the window is..." might not correctly
affect what comes after due to this issue.
Transformers were introduced by the paper "Attention is All You Need" (Vaswani et al.,
2017), which allowed parallel processing of sequences and solved the vanishing gradient
problem using attention mechanisms.
2. Key Concepts
Sequence-to-Sequence Models:
• Earlier Models (RNN, LSTM, GRU): These models work well for short sequences
but struggle with long-range dependencies due to their sequential nature.
Example: In a translation task, translating a sentence like "The quick brown fox
jumps over the lazy dog" requires the model to remember the subject "The quick
brown fox" when predicting the verb "jumps," which becomes harder with longer
sentences.
3. Transformer Architecture
High-Level Overview:
Components of Transformer:
• Positional Encoding: Since Transformers don't process sequences in order, they need
to know the position of each word. Positional encoding adds information about the
position of words in a sentence.
Example: In the sentence "The cat sits," "cat" comes after "The." Positional
encodings ensure this order is preserved.
Example: In a translation task, one attention head might focus on verb tenses while
another focuses on nouns. So, while translating "She is running fast," the model pays
attention to "running" when producing the correct tense.
• Feed-Forward Networks (FFN): After attention layers, the data passes through
dense (feed-forward) layers to perform further transformations.
• Layer Normalization and Residual Connections: These techniques make training
easier and faster by avoiding issues like overfitting and vanishing gradients.
• Softmax: Converts raw model outputs into probabilities over the vocabulary for each
word position. This helps in predicting the next word.
4. Self-Attention Mechanism
Example: In the sentence "The cat sat on the mat," if the current word is "cat," the Query is
"cat." The Keys are the other words in the sequence. Based on the Query "cat," the model
looks at other words (Keys) to understand that "sat" and "on the mat" are important for
context.
Self-Attention Formula:
Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) =
\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V
This formula calculates how much focus each word (Query) should give to the other words
(Keys).
Example: In "The quick brown fox jumps over the lazy dog," the word "jumps" (Query)
might give the most attention to "fox" (Key), as it needs to understand the subject.
These scores can be positive or negative and have a wide range of values.
2. Softmax for Normalization: To make the attention scores interpretable and ensure
they are in the form of probabilities, the softmax function is applied:
Here, dkd_kdk is the dimension of the key vector, and dividing by dk\sqrt{d_k}dk is
done to prevent the dot products from growing too large.
Why Softmax?
• Normalization: Softmax ensures that the attention scores sum up to 1, making them
easy to interpret as probabilities.
• Focus: By turning attention scores into probabilities, softmax allows the model to
focus more on relevant tokens while suppressing the importance of less relevant ones.
In short, the softmax function plays a critical role in transforming raw attention scores into
probabilities that guide how much attention should be paid to each token in the sequence
5. Hands-On Example: Implementing Transformer for Text Processing
In a basic example, we might use PyTorch to build a simple Transformer. Here, we're
focusing on understanding how each component works together.
python
Copy code
import torch
import torch.nn as nn
class SimpleTransformer(nn.Module):
def __init__(self, d_model, nhead, num_layers):
super(SimpleTransformer, self).__init__()
self.embedding = nn.Embedding(1000, d_model)
self.positional_encoding = nn.Parameter(torch.randn(1, 100,
d_model))
self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model,
nhead=nhead)
self.transformer = nn.TransformerEncoder(self.encoder_layer,
num_layers=num_layers)
self.fc = nn.Linear(d_model, 1)
This example demonstrates how to build a simple Transformer architecture using multi-head
attention.
Example: Summarizing a news article about the stock market into a few key
sentences.
7. Transformer Variants
Example: In the sentence "He went to the bank to deposit money," BERT uses both
the words "deposit" and "money" to understand that "bank" refers to a financial
institution.
• GPT (Generative Pre-trained Transformers): GPT is used for generating text and
completing sentences. It’s used in chatbots, writing assistants, and more.
Example: If you type "The future of AI is," GPT might generate: "The future of AI is
filled with potential breakthroughs in healthcare, automation, and space exploration."
• High Computation Costs: Training large models like GPT-3 requires significant
computational resources and energy.
Example: Training GPT-3 required the use of thousands of GPUs over weeks,
making it inaccessible for smaller companies or individual researchers.
• Data Bias: Transformers can learn and propagate biases from the training data.
9. Conclusion