0% found this document useful (0 votes)

93 views13 pages

GPT2 From Scratch in PyTorch

The notebook guides through building a GPT-2 model from scratch by covering topics like the Transformer architecture, data preparation, training, and inference of the model. The Transformer architecture uses self-attention and feedforward networks. The notebook will train the GPT-2 model on text data and generate text with the trained model.

Uploaded by

Rkt Sn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views13 pages

GPT2 From Scratch in PyTorch

Uploaded by

Rkt Sn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Generateive Pre-trained Transformer 2 From

Scratch
The purpose of this notebook is to guide through the process of building a Generative Pre-trained
Transformer 2 (GPT-2) model from scratch. GPT-2 was a state-of-the-art language generation model
developed by OpenAI, which has been trained on a large corpus of text data and can generate coherent and
contextually relevant text.

The impact of this model has been significant, as it has demonstrated the ability to generate human-like text
and perform well on a variety of natural language processing tasks. In this notebook, we will explore the
architecture of the GPT-2 model, train it on a text dataset, and evaluate its performance on a text generation
task.

This is the newer Decoder-only version of the Transformer architecture, which is used for language modeling
tasks. The Transformer architecture has been widely adopted in the field of natural language processing due
to its ability to capture long-range dependencies in text data and its parallelizable nature.

This is the original Transformer architecture, which consists of an encoder and a decoder.

The encoder processes the input sequence and generates a sequence of hidden states, while the decoder
generates the output sequence based on the encoder's hidden states and the previous output tokens.
The notebook will cover the following topics:

Overview of the Transformer architecture: Understand the key components of the Transformer architecture,
including self-attention mechanisms and feedforward neural networks.

Data preparation: Learn how to prepare and preprocess the text data for training the GPT-2 model.

Training: Train the GPT-2 model using the preprocessed text data and optimize its parameters to minimize a
suitable loss function.

Inference: Generate text using the trained GPT-2 model and evaluate its performance on a text generation
task.
References:

Attention is all you need (Transformer) By Umar Jamil

OpenAI GPT-2 Paper
The Illustrated Transformer
The Annotated Transformer

In [ ]: import math
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import DataLoader, Dataset

from transformers import AutoTokenizer

1. Input Embeddings
In [ ]: class InputEmbedding(nn.Module):
def __init__(self, embed_dim: int, vocab_size: int):
"""
Initialize the InputEmbedding module.

Args:
embed_dim (int): The dimensionality of the input embedding.
vocab_size (int): The size of the vocabulary.

"""
super().__init__()
# Store the dimensionality and vocabulary size
self.embed_dim = embed_dim
self.vocab_size = vocab_size

# Create an embedding layer that maps the vocabulary to a embed_dim-dimensional space

# The embedding layer should have shape (vocab_size, embed_dim)
self.embedding = nn.Embedding(vocab_size, embed_dim)

def forward(self, x):

"""
Perform the forward pass of the InputEmbedding module.

Args:
x (tensor): The input tensor.

Returns:
tensor: The embedded input tensor after scaling it by the square root of the dimensio

"""
# Embed the input tensor using the embedding layer
# Shape: (batch_size, seq_len) -> (batch_size, seq_len, embed_dim)
embedded_input = self.embedding(x)
# Scale the embedded input tensor by the square root of the dimensionality
# Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
scaled_embedded_input = embedded_input * torch.sqrt(torch.tensor(self.embed_dim))
return scaled_embedded_input
2. Positional Encoding
In [ ]: class PositionalEncoding(nn.Module):
def __init__(self, embed_dim: int = 512, max_seq_len: int = 100, dropout: float = 0.1,):
"""Initialize the PositionalEncoding module."""
super().__init__()
self.embed_dim = embed_dim
self.max_seq_len = max_seq_len
self.dropout = nn.Dropout(dropout)
# Precompute the positional encoding matrix
self.positional_encoding = self._precompute_positional_encoding(max_seq_len, embed_dim)

def _precompute_positional_encoding(self, max_seq_len, embed_dim):

"""Precompute the positional encoding matrix."""
with torch.no_grad():
# Create a positional encoding matrix of shape (max_seq_len, embed_dim)
positional_encoding = torch.zeros(max_seq_len, embed_dim)
# Create a tensor 'pos' with values [0, 1, 2, ..., max_seq_len - 1] (max_seq_len, 1)
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
# Compute the positional encoding matrix
division_term = torch.exp(torch.arange(0, embed_dim, 2).float() * (-torch.log(torch.t
positional_encoding[:, 0::2] = torch.sin(position * division_term)
positional_encoding[:, 1::2] = torch.cos(position * division_term)
# Shape (max_seq_len, embed_dim) -> (1, max_seq_len, embed_dim)
positional_encoding = positional_encoding.unsqueeze(0)

return positional_encoding

def forward(self, x):

"""Perform the forward pass of the PositionalEncoding module."""
# Add the positional encoding matrix to the input tensor
x = x + self.positional_encoding[:, : x.size(1)].to(x.device)
# Apply dropout to the input tensor
x = self.dropout(x)
return x

3. Layer Normalization
In [ ]: class LayerNormalization(nn.Module):
def __init__(self, embed_dim: int, eps: float = 1e-6):
"""Initialize the LayerNormalization module."""
super().__init__()
self.eps = eps
# Create two learnable parameters to scale and shift the normalized input
self.gain = nn.Parameter(torch.Tensor(embed_dim).uniform_()) # Initialize with values sa
self.bias = nn.Parameter(torch.Tensor(embed_dim).normal_()) # Initialize with values s

def forward(self, x):

"""Perform the forward pass of the LayerNormalization module."""
# Compute the mean and standard deviation of the input tensor
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
# Zero center by subtracting the mean from the input tensor
# Normalize scale by dividing by the standard deviation and add epsilon for numerical sta
# Scale and shift the normalized input using the learnable parameters
return (x - mean) / (std + self.eps) * self.gain + self.bias

4. Feed Forward Block

In [ ]: class FeedForwardBlock(nn.Module):
def __init__(self, embed_dim: int, intermediate_size: int, dropout: float = 0.1):
"""Initialize the FeedForwardBlock module.
embed_dim is the hidden size of the transformer model functions as input and output size
intermediate_size is the hidden size of the intermediate layer in the FeedForwardBlock
dropout is the dropout probability
"""
super().__init__()
# embed_dim is the dimensionality of the input and output of the FeedForwardBlock
# intermediate_size is the dimensionality of the intermediate layer in the FeedForwardBlo
self.fc1 = nn.Linear(embed_dim, intermediate_size) # W1 and B1 in the formula
self.fc2 = nn.Linear(intermediate_size, embed_dim) # W2 and B2 in the formula
self.dropout = nn.Dropout(dropout)

def forward(self, x):

"""Perform the forward pass of the FeedForwardBlock module."""
# (Batch, Seq_len, embed_dim) -> (Batch, Seq_len, intermediate_size) -> (Batch, Seq_len,
x_intermediate = self.dropout(F.relu(self.fc1(x)))
x_output = self.fc2(x_intermediate)
return x_output

5. Multi-Head Attention Block

In [ ]: def generate_square_subsequent_mask(size: int, device: torch.device = "cpu"):

"""Generate a square mask for the sequence."""
mask = torch.tril(torch.ones(size, size, dtype=torch.bool, device=device), diagonal=0)
# Turn boolean mask into float mask
mask = mask.long()
return mask.unsqueeze(0) # Add batch dimension

In [ ]: class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim: int = 512, num_heads: int = 8, attn_dropout: float = 0.1, ff_dr
super().__init__()
self.num_heads = num_heads
assert embed_dim % self.num_heads == 0, "invalid heads and embedding dimension configurat
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
self.query = nn.Linear(embed_dim, embed_dim)
self.proj = nn.Linear(embed_dim, embed_dim)
self.attn_dropout = nn.Dropout(attn_dropout)
self.proj_dropout = nn.Dropout(ff_dropout)
# Create a buffer to store the mask with no grad
# Shape: (1, max_len, max_len)
self.register_buffer(
"mask",
torch.triu(torch.ones(max_len, max_len, dtype=torch.bool), diagonal=1)
)

def forward(self, x, mask=None):

batch_size, seq_len, _ = x.size()
# Apply linear transformations to the input tensor
# Take input tensor and apply linear transformation,
# then split the tensor into num_heads and head_dim
# transpose the tensor into correct order
# Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, num_heads, head_dim) -
# (batch_size, seq_len, num_heads, head_dim) -> (batch_size, num_heads, seq_len, head_dim
q = self.query(x).view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
k = self.key(x).view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
v = self.value(x).view(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)

# Compute attention scores using Einsum

# b: batch size, h: num_heads, i: seq_len, j: seq_len, d: head_dim
# Multiply query and key tensors element-wise and sum along the shared dimension (head_di
# Divide by the square root of the dimension of the query/key vectors
# Equivalent to: attention = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(q.size(-1))
# Shape: (batch_size, num_heads, seq_len, head_dim) * (batch_size, num_heads, seq_len, he
# -> (batch_size, num_heads, seq_len, seq_len)
attention = torch.einsum('bhid,bhjd->bhij', q, k) / math.sqrt(q.size(-1))

# Apply mask if provided

if mask is not None:
attention = attention.masked_fill(mask == 0, float("-inf"))

# Apply softmax and dropout

# Shape: (batch_size, num_heads, seq_len, seq_len) -> (batch_size, num_heads, seq_len, he
attention = self.attn_dropout(F.softmax(attention, dim=-1))

# Compute the weighted sum of values using attention scores

# Equivalent to: torch.matmul(attention, v)
# Shape: (batch_size, num_heads, seq_len, seq_len) * (batch_size, num_heads, seq_len, hea
# -> (batch_size, num_heads, seq_len, head_dim)
y = torch.einsum('bhij,bhjd->bhid', attention, v)

# Merge the num_heads and head_dim back to the embed_dim

# Transpose sequence length and num_heads
# Flatten out the full tensor
# Reshape based on batch size, sequence length and embed_dim
# Shape: (batch_size, num_heads, seq_len, head_dim) -> (batch_size, seq_len, num_heads, h
# -> (batch_size, seq_len, num_heads * head_dim)
# -> (batch_size, seq_len, embed_dim)
y = y.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)

# Apply linear transformation and dropout

# Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
return self.proj_dropout(self.proj(y))

6. Residual Connection
In [ ]: class ResidualConnection(nn.Module):
def __init__(self, embed_dim, dropout: float = 0.1):
"""Initialize the ResidualConnection module."""
super().__init__()
self.layer_norm = LayerNormalization(embed_dim=embed_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):

"""Perform the forward pass of the ResidualConnection module."""
# Apply layer normalization
# (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
normalized_x = self.layer_norm(x)
# Apply sublayer (e.g., feedforward block)
# (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
sublayer_output = sublayer(normalized_x)
# Add residual connection and apply dropout
# (batch_size, seq_len, embed_dim) + (batch_size, seq_len, embed_dim) -> (batch_size, seq
residual_output = x + self.dropout(sublayer_output)
return residual_output

7. Projection Head
In [ ]: class ProjectionHead(nn.Module):
def __init__(self, embed_dim: int, vocab_size: int):
"""Initialize the ProjectionHead module."""
super().__init__()
self.fc = nn.Linear(embed_dim, vocab_size)

def forward(self, x):

"""Perform the forward pass of the ProjectionHead module."""
# Apply linear transformation to the input tensor
# (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, vocab_size)
return self.fc(x)

8. Transformer Block
In [ ]: class DecoderBlock(nn.Module):
def __init__(
self,
embed_dim: int = 512,
num_heads: int = 8,
ff_dim: int = 2048,
attn_dropout: float = 0.1,
ff_dropout: float = 0.1,
dropout: float = 0.1,
max_len: int = 512,
):
super().__init__()
# Initialize multi-head self-attention mechanism
self.MultiHeadAttention = MultiHeadAttention(
embed_dim=embed_dim,
num_heads=num_heads,
attn_dropout=attn_dropout,
ff_dropout=ff_dropout,
max_len=max_len,
)
# Initialize feed-forward block
self.feed_forward = FeedForwardBlock(
embed_dim=embed_dim,
intermediate_size=ff_dim,
dropout=ff_dropout,
)
# Initialize residual connections
self.residual_connection1 = ResidualConnection(embed_dim=embed_dim, dropout=dropout)
self.residual_connection2 = ResidualConnection(embed_dim=embed_dim, dropout=dropout)

def forward(self, x, attention_mask=None):

# Apply self-attention mechanism with residual connection
x_with_attention = self.residual_connection1(x, lambda x: self.MultiHeadAttention(x, mask
# Apply feed-forward block with residual connection
x_with_ff = self.residual_connection2(x_with_attention, self.feed_forward)
return x_with_ff

9. Building the Transformer

In [ ]: class GPT(nn.Module):
def __init__(
self,
vocab_size: int,
embed_dim: int = 512,
max_len: int = 512,
embed_dropout: float = 0.1,
num_blocks: int = 6,
num_heads: int = 8,
ff_dim: int = 2048,
attn_dropout: float = 0.1,
ff_dropout: float = 0.1
):
super().__init__()
self.max_len = max_len
self.token_embedding = InputEmbedding(
embed_dim=embed_dim,
vocab_size=vocab_size,
)
self.positional_embedding = PositionalEncoding(
embed_dim=embed_dim,
max_seq_len=max_len,
dropout=embed_dropout,
)
self.blocks = nn.ModuleList([DecoderBlock(
embed_dim=embed_dim,
num_heads=num_heads,
ff_dim=ff_dim,
attn_dropout=attn_dropout,
ff_dropout=ff_dropout,
max_len=max_len,
) for _ in range(num_blocks)])

self.projection_head = ProjectionHead(embed_dim=embed_dim, vocab_size=vocab_size)

def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor = None):

# Shape: (batch_size, seq_len) -> (seq_len)
seq_len = input_ids.size(1)
assert seq_len <= self.max_len, "Sequence longer than model capacity"

# Token embedding
# Shape: (batch_size, seq_len) -> (batch_size, seq_len, embed_dim)
x = self.token_embedding(input_ids) # (batch_size, seq_len, embed_dim)

# Add positional embedding

# Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
x = self.positional_embedding(x)

# Forward through decoder blocks

# output of each block is the hidden state of the transformer
# Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, embed_dim)
for block in self.blocks:
x = block(x, attention_mask=attention_mask)

# Linear layer for output logits

# Shape: (batch_size, seq_len, embed_dim) -> (batch_size, seq_len, vocab_size)
x = self.projection_head(x) # (batch_size, seq_len, vocab_size)

return x

10. Sample Usage

In [ ]: # Define model parameters
vocab_size = 50257 # Example vocab size; specific to GPT2 tokenizer
embed_dim = 768
max_len = 1024 # This can be adjusted based on the use case
embed_dropout = 0.1
num_blocks = 6 # This can be adjusted based on the use case
num_heads = 8 # This can be adjusted based on the use case
ff_dim = 2048 # This can be adjusted based on the use case
attn_dropout = 0.1
ff_dropout = 0.1

# Initialize GPT model

model = GPT(
vocab_size=vocab_size,
embed_dim=embed_dim,
max_len=max_len,
embed_dropout=embed_dropout,
num_blocks=num_blocks,
num_heads=num_heads,
ff_dim=ff_dim,
attn_dropout=attn_dropout,
ff_dropout=ff_dropout
)
11. Training the Transformer

11.1 Data Preprocessing

In [ ]: sample_data = [
"Mary had a little lamb",
"Its fleece was white as snow",
"And everywhere that Mary went",
"The lamb was sure to go",
]

In [ ]: class GPTDataset(Dataset):
def __init__(self, data:list, tokenizer, max_length:int):
self.data = data
self.tokenizer = tokenizer
self.max_length = max_length
self.end_token = tokenizer.eos_token_id

def __len__(self):
return len(self.data)

def getitem(self, idx):

text = self.data[idx]
input_txt = self.tokenizer(text, truncation=True, return_tensors="pt")["input_ids"].squee
text_len = input_txt.size(0)
if text_len < self.max_length:
padding_len = self.max_length - text_len
padding = torch.tensor([self.end_token] * padding_len)
input_ids = torch.cat((input_txt, padding), dim=0)
label = torch.cat((input_txt[1:], torch.tensor([self.end_token]), padding), dim=0)
else:
input_ids = input_txt[:self.max_length]
label = torch.cat((input_txt[1:self.max_length], torch.tensor([self.end_token])), dim
return input_ids, label

In [ ]: tokenizer = AutoTokenizer.from_pretrained("gpt2")

train_dataset = GPTDataset(
data = sample_data,
tokenizer = tokenizer,
max_length = 200,
)

In [ ]: input_ids, label = train_dataset[2]

input_ids = input_ids.unsqueeze(0)
label = label.unsqueeze(0)

print("Label:", label)
print("Input IDs:", input_ids)

print("Label Shape:", label.shape)

print("Input IDs Shape:", input_ids.shape)

11.2 Model Training

In [ ]: device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lr = 5e-5
batch_size = 2
num_epochs = 5

In [ ]: model.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,)

for epoch in range(num_epochs):

model.train()
total_loss = 0.0

for batch in train_loader:

optimizer.zero_grad()
# Unpack input and label from the batch and send them to the device
input_ids, labels = batch
input_ids, labels = input_ids.to(device), labels.to(device)

# Generate the causal mask

# Shape: (batch_size, seq_len, seq_len)
mask = generate_square_subsequent_mask(input_ids.size(1), device=device)

# Forward pass
logits = model(input_ids=input_ids, attention_mask=mask)

# Flatten the logits and labels for computing the loss

logits_flat = logits.view(-1, logits.size(-1))
labels_flat = labels.view(-1)

# Compute the loss

loss = criterion(logits_flat, labels_flat)

# Backward pass and optimization step

loss.backward()
optimizer.step()

total_loss += loss.item()

print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader)}')

12. Inference
In [ ]: vocab_size = 50257
embed_dim = 768
max_len = 1024
embed_dropout = 0.1
num_blocks = 12 # or 24 for GPT-2 XL
num_heads = 12 # or 24 for GPT-2 XL
ff_dim = 3072
attn_dropout = 0.1
ff_dropout = 0.1

# Initialize GPT model

In [ ]: model_name = "gpt2"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [ ]: input_txt = "Machine Learning with PyTorch can do amazing"

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

print(input_ids)
print(input_ids.shape)

In [ ]: model = model.to(device)
iterations = []
n_steps = 10
choices_per_step = 5

with torch.no_grad():
for _ in range(n_steps):
iteration = dict()
iteration["Input"] = tokenizer.decode(input_ids[0])
output = model(input_ids=input_ids)

# Select logits of the first batch and the last token and apply softmax to get the probab
next_token_logits = output[0, -1, :]
next_token_probs = torch.softmax(next_token_logits, dim=-1)
sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

# Store tokens with highest probabilities in our little table

for choice_idx in range(choices_per_step):
token_id = sorted_ids[choice_idx]
token_prob = next_token_probs[token_id].cpu().numpy()
token_choice = (
f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
)
iteration[f"Choice {choice_idx+1}"] = token_choice
iterations.append(iteration)

# Append predicted next token to input

input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)

sample_inference = pd.DataFrame(iterations)
sample_inference.head()

In [ ]: def generate_text_until_end(
input_text:str,
model:GPT,
tokenizer:AutoTokenizer,
max_length:int=100,
device='cpu',
):
model = model.to(device)
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
end_token_id = tokenizer.eos_token_id
generated_ids = input_ids.flatten().clone() # Convert to 1-dimensional tensor

with torch.no_grad():
while True:
output = model(input_ids=input_ids)
next_token_logits = output[:, -1, :]
# Apply softmax to get probabilities but probably not necessary
# because the max value will still be the max value after softmax
# next_token_probs = torch.softmax(next_token_logits, dim=-1)
next_token_id = torch.argmax(next_token_logits, dim=-1)
generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)
input_ids = next_token_id.unsqueeze(0)

if next_token_id == end_token_id or len(generated_ids) >= max_length:

break

generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

return generated_text

In [ ]: # Example usage:
generated_text = generate_text_until_end(
input_text="I like to eat",
model=model,
tokenizer=tokenizer,
max_length=20,
device=device,
)

print(generated_text)

ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Final DL
No ratings yet
Final DL
26 pages
A4
No ratings yet
A4
8 pages
Font Transfer 2 Autoencoders
No ratings yet
Font Transfer 2 Autoencoders
78 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
Auto Encoder
No ratings yet
Auto Encoder
4 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Lab 6
No ratings yet
Lab 6
29 pages
L6 Multilayer FeedForward Network XOR & MNIST DIGIT
No ratings yet
L6 Multilayer FeedForward Network XOR & MNIST DIGIT
51 pages
Correct The Error
No ratings yet
Correct The Error
11 pages
GPT 2 - Learninhg 4
0% (1)
GPT 2 - Learninhg 4
2 pages
Exp 6,7,8
No ratings yet
Exp 6,7,8
17 pages
Chapter 3
No ratings yet
Chapter 3
14 pages
Variational AutoEncoders (VAE) With PyTorch - Alexander Van de Kleut
No ratings yet
Variational AutoEncoders (VAE) With PyTorch - Alexander Van de Kleut
17 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Assignment 9
No ratings yet
Assignment 9
4 pages
LLM Fine Tune
No ratings yet
LLM Fine Tune
11 pages
Modeling Chatglm
No ratings yet
Modeling Chatglm
20 pages
LLM Code Ref
No ratings yet
LLM Code Ref
10 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
45 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Astro AI
No ratings yet
Astro AI
20 pages
(Deep Learning Using PyTorch) (Cheatsheet)
No ratings yet
(Deep Learning Using PyTorch) (Cheatsheet)
7 pages
Mlp-Fromscratch Sigmoid-Mse
No ratings yet
Mlp-Fromscratch Sigmoid-Mse
13 pages
NLP 4
No ratings yet
NLP 4
10 pages
GPT 2 - Learninhg 3
No ratings yet
GPT 2 - Learninhg 3
2 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Beginner's PyTorch Guide
No ratings yet
Beginner's PyTorch Guide
35 pages
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
No ratings yet
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
18 pages
Deep Learning
No ratings yet
Deep Learning
46 pages
R Deep Neural Network Step by Step
No ratings yet
R Deep Neural Network Step by Step
27 pages
Project Source
No ratings yet
Project Source
21 pages
EncoderDecoderSeq2Seq DeepLSTM
No ratings yet
EncoderDecoderSeq2Seq DeepLSTM
7 pages
TXT
No ratings yet
TXT
7 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Chap 6 Embedding
No ratings yet
Chap 6 Embedding
44 pages
GPT 2 - Learninhg 2
No ratings yet
GPT 2 - Learninhg 2
2 pages
Astro AI
No ratings yet
Astro AI
20 pages
Tutorials Sources Beginner Ptcheat
No ratings yet
Tutorials Sources Beginner Ptcheat
7 pages
Assignment No 4
No ratings yet
Assignment No 4
8 pages
Module02 PyTorch
No ratings yet
Module02 PyTorch
36 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
ML MCQs
55% (11)
ML MCQs
17 pages
Harvard CS197 Lecture 6 & 7 Notes
No ratings yet
Harvard CS197 Lecture 6 & 7 Notes
18 pages
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
No ratings yet
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
9 pages
Assignment3 AL
No ratings yet
Assignment3 AL
23 pages
Soft Computing Book
No ratings yet
Soft Computing Book
130 pages
Image Processing and Machine Vision
0% (1)
Image Processing and Machine Vision
2 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
SC QB (24-25)
No ratings yet
SC QB (24-25)
14 pages
Intro To Pytorch
No ratings yet
Intro To Pytorch
12 pages
Pytorch Neural Networks Guide 1717173717
No ratings yet
Pytorch Neural Networks Guide 1717173717
17 pages
Building Your Deep Neural Network - Step by Step v8 PDF
No ratings yet
Building Your Deep Neural Network - Step by Step v8 PDF
44 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
Assignment 3 DS5620
No ratings yet
Assignment 3 DS5620
11 pages
Optimization For Long-Term Dependencies
No ratings yet
Optimization For Long-Term Dependencies
57 pages
BME 6407 - Class 1 (April 2023)
No ratings yet
BME 6407 - Class 1 (April 2023)
43 pages
2K22 - B17 - 49 PRIYANSHU NANDAN - Multi Layer Perceptrons Reference
No ratings yet
2K22 - B17 - 49 PRIYANSHU NANDAN - Multi Layer Perceptrons Reference
32 pages
Ieee Paper4
No ratings yet
Ieee Paper4
16 pages
PyTorch Cheat Sheet & Quick Reference
No ratings yet
PyTorch Cheat Sheet & Quick Reference
6 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
Generative AI Roadmap 1740183235
No ratings yet
Generative AI Roadmap 1740183235
15 pages
Transformer
No ratings yet
Transformer
5 pages
Stock Market Prediction of NIFTY 50 Index Applying
No ratings yet
Stock Market Prediction of NIFTY 50 Index Applying
24 pages
Analysis of Recommender System Using Generative Artificial Intelligence A Systematic Literature Review
No ratings yet
Analysis of Recommender System Using Generative Artificial Intelligence A Systematic Literature Review
25 pages
Back-Propagation Algorithm
No ratings yet
Back-Propagation Algorithm
51 pages
AAIML
No ratings yet
AAIML
10 pages
Machine Learning Questions and Answers For Interview
No ratings yet
Machine Learning Questions and Answers For Interview
20 pages
Building A Tanh Activation Function
No ratings yet
Building A Tanh Activation Function
9 pages
CV 2024 - Obeb Fkiri
No ratings yet
CV 2024 - Obeb Fkiri
1 page
20 Objective Questions On AI
No ratings yet
20 Objective Questions On AI
3 pages
Summative Exam - Results
No ratings yet
Summative Exam - Results
3 pages
DL Unit 5
No ratings yet
DL Unit 5
2 pages
Online FDP Schedule
No ratings yet
Online FDP Schedule
1 page
Deep Learning Notes For Easy Access
No ratings yet
Deep Learning Notes For Easy Access
14 pages
ARC Prize
No ratings yet
ARC Prize
1 page
Module1 (ML)
No ratings yet
Module1 (ML)
2 pages
Brochure AIMLEA 2024 SBI Collect REVISED Dates
No ratings yet
Brochure AIMLEA 2024 SBI Collect REVISED Dates
6 pages
00-Sample Numericals
No ratings yet
00-Sample Numericals
6 pages
ML Set 1 QB Question Paper
No ratings yet
ML Set 1 QB Question Paper
4 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Feb - 2023
No ratings yet
Feb - 2023
1 page
Unlock The Power of ChatGPT
No ratings yet
Unlock The Power of ChatGPT
3 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

GPT2 From Scratch in PyTorch

Uploaded by

GPT2 From Scratch in PyTorch

Uploaded by

Generateive Pre-trained Transformer 2 From

Attention is all you need (Transformer) By Umar Jamil

from torch.utils.data import DataLoader, Dataset

# Create an embedding layer that maps the vocabulary to a embed_dim-dimensional space

def forward(self, x):

def _precompute_positional_encoding(self, max_seq_len, embed_dim):

def forward(self, x):

def forward(self, x):

4. Feed Forward Block

def forward(self, x):

5. Multi-Head Attention Block

In [ ]: def generate_square_subsequent_mask(size: int, device: torch.device = "cpu"):

def forward(self, x, mask=None):

# Compute attention scores using Einsum

# Apply mask if provided

# Apply softmax and dropout

# Compute the weighted sum of values using attention scores

# Merge the num_heads and head_dim back to the embed_dim

# Apply linear transformation and dropout

def forward(self, x, sublayer):

def forward(self, x):

def forward(self, x, attention_mask=None):

9. Building the Transformer

self.projection_head = ProjectionHead(embed_dim=embed_dim, vocab_size=vocab_size)

def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor = None):

# Add positional embedding

# Forward through decoder blocks

# Linear layer for output logits

10. Sample Usage

# Initialize GPT model

11.1 Data Preprocessing

def __getitem__(self, idx):

In [ ]: input_ids, label = train_dataset[2]

print("Label Shape:", label.shape)

11.2 Model Training

for epoch in range(num_epochs):

for batch in train_loader:

# Generate the causal mask

# Flatten the logits and labels for computing the loss

# Compute the loss

# Backward pass and optimization step

print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader)}')

# Initialize GPT model

In [ ]: input_txt = "Machine Learning with PyTorch can do amazing"

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)

# Store tokens with highest probabilities in our little table

# Append predicted next token to input

if next_token_id == end_token_id or len(generated_ids) >= max_length:

generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

def getitem(self, idx):