0% found this document useful (0 votes)
30 views11 pages

LLM Fine Tune

This document provides a comprehensive cheat sheet for generative AI engineering, specifically focusing on fine-tuning transformers using various methods and packages in PyTorch. It includes code examples for implementing positional encoding, importing the IMDB dataset, creating iterators, and building vocabulary objects from pretrained GloVe embeddings. Additionally, it covers training and prediction functions, as well as fine-tuning models on datasets like AG News and IMDB for text classification tasks.

Uploaded by

S R Saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

LLM Fine Tune

This document provides a comprehensive cheat sheet for generative AI engineering, specifically focusing on fine-tuning transformers using various methods and packages in PyTorch. It includes code examples for implementing positional encoding, importing the IMDB dataset, creating iterators, and building vocabulary objects from pretrained GloVe embeddings. Additionally, it covers training and prediction functions, as well as fine-tuning models on datasets like AG News and IMDB for text classification tasks.

Uploaded by

S R Saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

19/01/2025, 18:50 about:blank

Cheat Sheet: Generative AI Engineering and Fine-Tuning Transformers


Package/Method Description Code example

class PositionalEncoding(nn.Module):
"""
https://pytorch.org/tutorials/beginner/transformer_tutorial.html
"""
def __init__(self, d_model, vocab_size=5000, dropout=0.1):
super().__init__()
Pivotal in transformers and self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(vocab_size, d_model)
sequence-to-sequence position = torch.arange(0, vocab_size, dtype=torch.float).unsq
models, conveying critical div_term = torch.exp(
Positional encoding information regarding the torch.arange(0, d_model, 2).float()
positions or sequencing of * (-math.log(10000.0) / d_model)
elements within a given )
pe[:, 0::2] = torch.sin(position * div_term)
sequence. pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer("pe", pe)
def forward(self, x):
x = x + self.pe[:, : x.size(1), :]
return self.dropout(x)

The IMDB data set contains


movie reviews from the
internet movie database
(IMDB) and is commonly urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storag
used for binary sentiment tar = tarfile.open(fileobj=io.BytesIO(urlopened.read()))
tempdir = tempfile.TemporaryDirectory()
Importing IMBD data set classification tasks. It's a
tar.extractall(tempdir.name)
popular data set for training tar.close()
and testing models in natural
language processing (NLP),
particularly in sentiment
analysis.

root_dir = tempdir.name + '/' + 'imdb_dataset'


Creates iterators for training train_iter = IMDBDataset(root_dir=root_dir, train=True) # For trainin
and testing data sets that test_iter = IMDBDataset(root_dir=root_dir, train=False) # For test da
IMDBDataset class to create iterators
involve various steps, such as start=train_iter.pos_inx
for the train and test datasets for i in range(-10,10):
data loading, preprocessing,
and creating iterators. print(train_iter[start+i])

class GloVe_override(Vectors):
url = {
"6B": "https://cf-courses-data.s3.us.cloud-object-storage.appd
}
def __init__(self, name="6B", dim=100, **kwargs) -> None:
url = self.url[name]
name = "glove.{}.{}d.txt".format(name, str(dim))
An unsupervised learning #name = "glove.{}/glove.{}.{}d.txt".format(name, name, str(dim
algorithm to obtain vector super(GloVe_override, self).__init__(name, url=url, **kwargs)
class GloVe_override2(Vectors):
representations for words. url = {
GloVe model is trained on "6B": "https://cf-courses-data.s3.us.cloud-object-storage.appd
the aggregated global word- }
GloVe embeddings def __init__(self, name="6B", dim=100, **kwargs) -> None:
to-word co-occurrence
statistics from a corpus, and url = self.url[name]
#name = "glove.{}.{}d.txt".format(name, str(dim))
the resulting representations name = "glove.{}/glove.{}.{}d.txt".format(name, name, str(dim)
show linear substructures of super(GloVe_override2, self).__init__(name, url=url, **kwargs)
the word vector base. try:
glove_embedding = GloVe_override(name="6B", dim=100)
except:
try:
glove_embedding = GloVe_override2(name="6B", dim=100)
except:
glove_embedding = GloVe(name="6B", dim=100)

Involves various steps for from torchtext.vocab import GloVe,vocab


Building vocabulary object from creating a structured # Build vocab from glove_vectors
pretrained GloVe word embedding representation of words and vocab = vocab(glove_embedding .stoi, 0,specials=('<unk>', '<pad>'))
model their corresponding vector vocab.set_default_index(vocab["<unk>"])
embeddings.

The training data set will


contain 95% of the samples
in the original training set,
while the validation data set
will contain the remaining
5%. These data sets can be
Convert the training and testing used for training and train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
iterators to map-style data sets evaluating a machine-
learning model for text
classification on the IMDB
data set. The final
performance of the model
will be evaluated on the hold-
out test set.

about:blank 1/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example

Available in the system using


PyTorch, a popular deep-
learning framework. If a
GPU is available, it assigns
the device variable to "cuda"
(CUDA is the parallel
computing platform and device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
CUDA-compatible GPU device
application programming
interface model developed by
NVIDIA). If a GPU is not
available, it assigns the
device variable to "cpu"
(which means the code will
run on the CPU instead).

Shows that collate_fn


function is used in
conjunction with data loaders
to customize the way batches
are created from individual
samples. A collate_batch
function in PyTorch is used
with data loaders to
customize batch creation
from individual samples. It
processes a batch of data, from torch.nn.utils.rnn import pad_sequence
def collate_batch(batch):
including labels and text label_list, text_list = [], []
sequences. It applies the for _label, _text in batch:
text_pipeline function to label_list.append(_label)
collate_fn text_list.append(torch.tensor(text_pipeline(_text), dtype=torc
preprocess the text. The
processed data is then label_list = torch.tensor(label_list, dtype=torch.int64)
text_list = pad_sequence(text_list, batch_first=True)
converted into PyTorch return label_list.to(device), text_list.to(device)
tensors and returned as a
tuple containing the label
tensor, text tensor, and offsets
tensor representing the
starting positions of each text
sequence in the combined
tensor. The function also
ensures that the returned
tensors are moved to the
specified device (GPU) for
efficient computation.

ATCH_SIZE = 32
train_dataloader = DataLoader(
Used in PyTorch-based split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=coll
projects. It includes creating )
Convert the data set objects to data data set objects, specifying valid_dataloader = DataLoader(
split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=coll
loaders data loading parameters, and )
converting these data sets test_dataloader = DataLoader(
into data loaders. test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=coll
)

The predict function takes in


a text, a text pipeline, and a def predict(text, text_pipeline, model):
model as inputs. It uses a with torch.no_grad():
pretrained model passed as a text = torch.unsqueeze(torch.tensor(text_pipeline(text)),0).to
Predict function model.to(device)
parameter to predict the label output = model(text)
of the text for text return imdb_label[output.argmax(1).item()]
classification on the IMDB
data set.

Training function Helps in the training model, def train_model(model, optimizer, criterion, train_dataloader, valid_d
iteratively update the model's cum_loss_list = []
acc_epoch = []
parameters to minimize the acc_old = 0
loss function. It improves the model_path = os.path.join(save_dir, file_name)
model's performance on a acc_dir = os.path.join(save_dir, os.path.splitext(file_name)[0] +
given task. loss_dir = os.path.join(save_dir, os.path.splitext(file_name)[0] +
time_start = time.time()
for epoch in tqdm(range(1, epochs + 1)):
model.train()
#print(model)
#for parm in model.parameters():
# print(parm.requires_grad)
cum_loss = 0
for idx, (label, text) in enumerate(train_dataloader):
optimizer.zero_grad()
label, text = label.to(device), text.to(device)
predicted_label = model(text)
loss = criterion(predicted_label, label)
loss.backward()
#print(loss)
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
optimizer.step()
cum_loss += loss.item()
print(f"Epoch {epoch}/{epochs} - Loss: {cum_loss}")
cum_loss_list.append(cum_loss)

about:blank 2/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example


accu_val = evaluate_no_tqdm(valid_dataloader,model)
acc_epoch.append(accu_val)
if model_path and accu_val > acc_old:
print(accu_val)
acc_old = accu_val
if save_dir is not None:
pass
#print("save model epoch",epoch)
#torch.save(model.state_dict(), model_path)
#save_list_to_file(lst=acc_epoch, filename=acc_dir)
#save_list_to_file(lst=cum_loss_list, filename=loss_di
time_end = time.time()
print(f"Training time: {time_end - time_start}")

train_iter_ag_news = AG_NEWS(split="train")
num_class_ag_news = len(set([label for (label, text) in train_iter_ag_
num_class_ag_news
# Split the dataset into training and testing iterators.
train_iter_ag_news, test_iter_ag_news = AG_NEWS()
# Convert the training and testing iterators to map-style datasets.
train_dataset_ag_news = to_map_style_dataset(train_iter_ag_news)
test_dataset_ag_news = to_map_style_dataset(test_iter_ag_news)
# Determine the number of samples to be used for training and validati
num_train_ag_news = int(len(train_dataset_ag_news) * 0.95)
# Randomly split the training dataset into training and validation dat
# The training dataset will contain 95% of the samples, and the valida
split_train_ag_news_, split_valid_ag_news_ = random_split(train_datase
# Make the training set smaller to allow it to run fast as an example.
# IF YOU WANT TO TRAIN ON THE AG_NEWS DATASET, COMMENT OUT THE 2 LINEs
# HOWEVER, NOTE THAT TRAINING WILL TAKE A LONG TIME
Fine-tuning a model on the num_train_ag_news = int(len(train_dataset_ag_news) * 0.05)
pretrained AG News data set split_train_ag_news_, _ = random_split(split_train_ag_news_, [num_trai
is to categorize news articles device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
into one of four categories: device
def label_pipeline(x):
Sports, Business, Sci/Tech, or return int(x) - 1
World. Start training a model from torch.nn.utils.rnn import pad_sequence
from scratch on the AG News def collate_batch_ag_news(batch):
data set. If you want to train label_list, text_list = [], []
Fine-tune a model in the AG News the model for 2 epochs on a for _label, _text in batch:
label_list.append(label_pipeline(_label))
data set smaller data set to text_list.append(torch.tensor(text_pipeline(_text), dtype=torch.in
demonstrate what the training label_list = torch.tensor(label_list, dtype=torch.int64)
process would look like, text_list = pad_sequence(text_list, batch_first=True)
uncomment the part that says return label_list.to(device), text_list.to(device)
### Uncomment to Train ### BATCH_SIZE = 32
train_dataloader_ag_news = DataLoader(
before running the cell. split_train_ag_news_, batch_size=BATCH_SIZE, shuffle=True, collate
Training for 2 epochs on the )
reduced data set can take valid_dataloader_ag_news = DataLoader(
approximately 3 minutes. split_valid_ag_news_, batch_size=BATCH_SIZE, shuffle=True, collate
)
test_dataloader_ag_news = DataLoader(
test_dataset_ag_news, batch_size=BATCH_SIZE, shuffle=True, collate
)
model_ag_news = Net(num_class=4,vocab_size=vocab_size).to(device)
model_ag_news.to(device)
'''
### Uncomment to Train ###
LR=1
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_ag_news.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
save_dir = ""
file_name = "model_AG News small1.pth"
train_model(model=model_ag_news, optimizer=optimizer, criterion=criter

Plots the cost and validation


data accuracy for each epoch
of the pretrained model up to acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-st
and including the epoch that loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-s
Cost and validation data accuracy for yielded the highest accuracy. acc_epoch = pickle.load(acc_urlopened)
each epoch As you can see, the cum_loss_list = pickle.load(loss_urlopened)
pretrained model achieved a plot(cum_loss_list,acc_epoch)
high accuracy of over 90%
on the AG News validation
set.

Fine-tuning the final output


layer of a neural network is
similar to fine-tuning the
whole model. You can begin urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storag
model_fine2 = Net(vocab_size=vocab_size, num_class=4).to(device)
Fine-tune the final layer by loading the pretrained
model_fine2.load_state_dict(torch.load(io.BytesIO(urlopened.read()), m
model you would like to fine-
tune. In this case, the same
model is pretrained on the
AG News data set.

The code snippet helps


acc_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-st
achieve a well-optimized loss_urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-s
Fine-tune full IMDB training set for model that accurately acc_epoch = pickle.load(acc_urlopened)
100 epoch classifies movie reviews into cum_loss_list = pickle.load(loss_urlopened)
positive or negative plot(cum_loss_list,acc_epoch)
sentiments.

about:blank 3/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example

class FeatureAdapter(nn.Module):
"""
FeatureAdapter is a neural Attributes:
network module that size (int): The bottleneck dimension to which the embeddings a
model_dim (int): The original dimension of the embeddings or f
introduces a low-dimensional """
bottleneck in a transformer def __init__(self, bottleneck_size=50, model_dim=100):
architecture to allow fine- super().__init__()
tuning with fewer self.bottleneck_transform = nn.Sequential(
parameters. It compresses the nn.Linear(model_dim, bottleneck_size), # Down-project to
nn.ReLU(), # Apply non-lineari
original high-dimensional nn.Linear(bottleneck_size, model_dim) # Up-project back t
embeddings into a lower )
Adaptor model dimension, applies a def forward(self, x):
nonlinear transformation, and """
then expands it back to the Forward pass of the FeatureAdapter. Applies the bottleneck tra
tensor and adds a skip connection.
original dimension. This Args:
process is followed by a x (Tensor): Input tensor with shape (batch_size, seq_lengt
residual connection that adds Returns:
the transformed output back Tensor: Output tensor after applying the adapter transform
to the original input to maintaining the original input shape.
"""
preserve information and transformed_features = self.bottleneck_transform(x) # Transfo
promote gradient flow. output_with_residual = transformed_features + x # Add the
return output_with_residual

class IMDBDataset(Dataset):
def __init__(self, root_dir, train=True):
"""
root_dir: The base directory of the IMDB dataset.
train: A boolean flag indicating whether to use training or te
"""
This code snippet traverses self.root_dir = os.path.join(root_dir, "train" if train else "
the IMDB data set by self.neg_files = [os.path.join(self.root_dir, "neg", f) for f
obtaining, loading, and self.pos_files = [os.path.join(self.root_dir, "pos", f) for f
exploring the data set. It also self.files = self.neg_files + self.pos_files
Traverse the IMDB data set self.labels = [0] * len(self.neg_files) + [1] * len(self.pos_f
performs basic operations, self.pos_inx=len(self.pos_files)
visualizes the data, and def __len__(self):
analyzes and interprets the return len(self.files)
data set. def __getitem__(self, idx):
file_path = self.files[idx]
label = self.labels[idx]
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
return label, content

This code snippet indicates a


path to the IMDB data set
directory by combining root_dir = tempdir.name + '/' + 'imdb_dataset'
temporary and subdirectory train_iter = IMDBDataset(root_dir=root_dir, train=True) # For trainin
names. This code sets up the test_iter = IMDBDataset(root_dir=root_dir, train=False) # For test da
Iterators to train and test data sets start=train_iter.pos_inx
training and testing data for i in range(-10,10):
iterators, retrieves the starting print(train_iter[start+i])
index of the training data,
and prints the items from the
training data set at indices.

Generates tokens from the


collection of text data
samples. The code snippet tokenizer = get_tokenizer("basic_english")
processes each text in def yield_tokens(data_iter):
'data_iter' through the """Yield tokens for each data sample."""
yield_tokens function
tokenizer and yields tokens to for _, text in data_iter:
generate efficient, on-the-fly yield tokenizer(text)
token generation suitable for
tasks such as training
machine learning models.

This code snippet helps


download a pretrained model urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storag
Load pretrained model and its from URL, loads it into a model_ = Net(vocab_size=vocab_size, num_class=2).to(device)
model_.load_state_dict(torch.load(io.BytesIO(urlopened.read()), map_lo
evaluation on test data specific architecture, and evaluate(test_dataloader, model_)
evaluates it on a test data set
for assessing its performance.

This code snippet initiates a


tokenizer using a pretrained # Instantiate a tokenizer using the BERT base cased model
'bert-base-cased' model. It tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
also downloads a pretrained # Download pretrained model from huggingface.co and cache.
model = BertForMaskedLM.from_pretrained('bert-base-cased')
Loading the Hugging Face model model for the masked
# You can also start training from scratch by loading the model config
language model (MLM) task, # config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
and how to load the model # model = BertForMaskedLM.from_config(config)
configurations from a
pretrained model.

Training a BERT model for MLM task This code snippet trains the training_args = TrainingArguments(
model with the specified output_dir="./trained_model", # Specify the output directory for
overwrite_output_dir=True,
parameters and data set. do_eval=False,
However, ensure that the learning_rate=5e-5,

about:blank 4/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example


'SFTTrainer' is the num_train_epochs=1, # Specify the number of training epochs
appropriate trainer class for per_device_train_batch_size=2, # Set the batch size for training
save_total_limit=2, # Limit the total number of saved checkpoints
the task and that the model is logging_steps = 20
properly defined for training. )
dataset = load_dataset("imdb", split="train")
trainer = SFTTrainer(
model,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
)

Useful for tasks where you


need to quickly classify the tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncas
Load the model and tokenizer sentiment of a piece of text model = DistilBertForSequenceClassification.from_pretrained("distilber
with a pretrained, efficient
transformer model.

The torch.no_grad() context


manager disables gradient
calculation. This reduces
memory consumption and
speeds up computation, as # Perform inference
gradients are unnecessary for with torch.no_grad():
torch.no_grad() outputs = model(**inputs)
inference (for example, when
you are not training the
model). The **inputs syntax
is used to unpack a dictionary
of keyword arguments in
Python.

Helps to initialize the GPT-2


tokenizer using a pretrained # Load the tokenizer and model
GPT-2 tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model to handle encoding
and decoding.

This code snippet initializes


and loads the pretrained GPT-
2 model. This code makes the # Load the tokenizer and model
Load GPT-2 model model = GPT2LMHeadModel.from_pretrained("gpt2")
GPT-2 model ready for
generating text or other
language tasks.

# Generate text
output_ids = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
This code snippet generates max_length=50,
text sequences based on the num_return_sequences=1
Generate text
input and doesn't compute the )
gradient to generate output. output_ids
or
with torch.no_grad():
outputs = model(**inputs)
outputs

This code snippet decodes


the text from the token IDs # Decode the generated text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=T
Decode the generated text generated by a model. It also print(generated_text)
decodes it into a readable
string to print it.

The pipeline() function from


the Hugging Face
transformers.pipeline(
transformers library is a high- task: str,
level API designed to model: Optional = None,
simplify the usage of config: Optional = None,
pretrained models for various tokenizer: Optional = None,
natural language processing feature_extractor: Optional = None,
Hugging Face pipeline() function framework: Optional = None,
(NLP) tasks. It abstracts the revision: str = 'main',
complexities of model use_fast: bool = True,
loading, tokenization, model_kwargs: Dict[str, Any] = None,
inference, and post- **kwargs
processing, allowing users to )
perform complex NLP tasks
with just a few lines of code.

formatting_prompts_func_no_response The prompt function def formatting_prompts_func(mydataset):


function generates formatted text output_texts = []
for i in range(len(mydataset['instruction'])):
prompts from a data set by text = (
using the instructions from f"### Instruction:\n{mydataset['instruction'][i]}"
the dataset. It creates strings f"\n\n### Response:\n{mydataset['output'][i]}"
that include only the )
instruction and a placeholder output_texts.append(text)
return output_texts
for the response. def formatting_prompts_func_no_response(mydataset):
output_texts = []

about:blank 5/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example


for i in range(len(mydataset['instruction'])):
text = (
f"### Instruction:\n{mydataset['instruction'][i]}"
f"\n\n### Response:\n"
)
output_texts.append(text)
return output_texts

Tokenize instructions and the


instructions_with_responses.
Then, count the number of
tokens in instructions and
discard the equivalent expected_outputs = []
amount of tokens from the instructions_with_responses = formatting_prompts_func(test_dataset)
beginning of the tokenized instructions = formatting_prompts_func_no_response(test_dataset)
instructions_with_responses for i in tqdm(range(len(instructions_with_responses))):
expected_outputs tokenized_instruction_with_response = tokenizer(instructions_with_
vector. Finally, discard the tokenized_instruction = tokenizer(instructions[i], return_tensors=
final token in expected_output = tokenizer.decode(tokenized_instruction_with_resp
instructions_with_responses, expected_outputs.append(expected_output)
corresponding to the eos
token. Decode the resulting
vector using the tokenizer,
resulting in the
expected_output

class ListDataset(Dataset):
def __init__(self, original_list):
Inherits from Dataset and self.original_list = original_list
creates a torch Dataset from a def __len__(self):
ListDataset list. This class is then used to return len(self.original_list)
generate a Dataset object def __getitem__(self, i):
from instructions. return self.original_list[i]
instructions_torch = ListDataset(instructions)

gen_pipeline = pipeline("text-generation",
model=model,
This code snippet takes the tokenizer=tokenizer,
token IDs from the model device=device,
batch_size=2,
gen_pipeline output, decodes it from the max_length=50,
table text, and prints the truncation=True,
responses. padding=False,
return_full_text=False)

with torch.no_grad():
# Due to resource limitation, only apply the function on 3 records
This code generates text from
pipeline_iterator= gen_pipeline(instructions_torch[:3],
the given input using a max_length=50, # this is set to 50
pipeline while optimizing num_beams=5,
torch.no_grad()
resource usage by limiting early_stopping=True,)
input size and reducing generated_outputs_base = []
for text in pipeline_iterator:
gradient calculations. generated_outputs_base.append(text[0]["generated_text"])

training_args = SFTConfig(
output_dir="/tmp",
num_train_epochs=10,
save_strategy="epoch",
fp16=True,
This code snippet sets and per_device_train_batch_size=2, # Reduce batch size
per_device_eval_batch_size=2, # Reduce batch size
initializes a training
max_seq_length=1024,
configuration for a model do_eval=True
using 'SFTTrainer' by )
SFTTrainer
specifying parameters and trainer = SFTTrainer(
initializes the 'SFTTrainer' model,
train_dataset=train_dataset,
with the model, datasets, and
eval_dataset=test_dataset,
additional settings. formatting_func=formatting_prompts_func,
args=training_args,
packing=False,
data_collator=collator,
)

This code snippet helps with torch.no_grad():


# Due to resource limitation, only apply the function on 3 records
generate text sequences from
pipeline_iterator= gen_pipeline(instructions_torch[:3],
the pipeline function. It max_length=50, # this is set to 50 due
ensures that the gradient num_beams=5,
torch.no_grad()
computations are disabled early_stopping=True,)
and optimizes the generated_outputs_lora = []
for text in pipeline_iterator:
performance and memory
generated_outputs_lora.append(text[0]["generated_text"])
usage.

about:blank 6/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example

This code snippet uses


LangChain library for
loading and using a from langchain.chains.summarize import load_summarize_chain
summarization chain with a chain = load_summarize_chain(llm=mixtral_llm, chain_type="stuff", verb
load_summarize_chain response = chain.invoke(web_data)
specific language model and print(response['output_text'])n
chain type. This chain type
will be applied to web data to
print a resulting summary.

from torch import nn


class TextClassifier(nn.Module):
def __init__(self, num_classes,freeze=False):
super(TextClassifier, self).__init__()
Represents a simple text self.embedding = nn.Embedding.from_pretrained(glove_embedding.
classifier that uses an # An example of adding additional layers: A linear layer and a
embedding layer, a hidden self.fc1 = nn.Linear(in_features=100, out_features=128)
linear layer with a ReLU self.relu = nn.ReLU()
avtivation, and an output # The output layer that gives the final probabilities for the
self.fc2 = nn.Linear(in_features=128, out_features=num_classes
TextClassifier linear layer. The constructor
def forward(self, x):
takes the following # Pass the input through the embedding layer
arguments: num_class: The x = self.embedding(x)
number of classes to classify. # Here you can use a simple mean pooling
freeze: Whether to freeze the x = torch.mean(x, dim=1)
# Pass the pooled embeddings through the additional layers
embedding layer. x = self.fc1(x)
x = self.relu(x)
return self.fc2(x)

def train_model(model, optimizer, criterion, train_dataloader, valid_d


cum_loss_list = []
acc_epoch = []
best_acc = 0
file_name = model_name
for epoch in tqdm(range(1, epochs + 1)):
model.train()
cum_loss = 0
for _, (label, text) in enumerate(train_dataloader):
This code snippet outlines the optimizer.zero_grad()
function to train a machine predicted_label = model(text)
learning model using loss = criterion(predicted_label, label)
PyTorch. This function trains loss.backward()
Train the model torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
the model over a specified optimizer.step()
number of epochs, tracks cum_loss += loss.item()
them, and evaluates the #print("Loss:", cum_loss)
performance on the data set. cum_loss_list.append(cum_loss)
acc_val = evaluate(valid_dataloader, model, device)
acc_epoch.append(acc_val)
if acc_val > best_acc:
best_acc = acc_val
print(f"New best accuracy: {acc_val:.4f}")
#torch.save(model.state_dict(), f"{model_name}.pth")
#save_list_to_file(cum_loss_list, f"{model_name}_loss.pkl")
#save_list_to_file(acc_epoch, f"{model_name}_acc.pkl")

def plot_matrix_and_subspace(F):
assert F.shape[0] == 3, "Matrix F must have rows equal to 3 for 3D
ax = plt.figure().add_subplot(projection='3d')
# Plot each column vector of F as a point and line from the origin
for i in range(F.shape[1]):
ax.quiver(0, 0, 0, F[0, i], F[1, i], F[2, i], color='blue', ar
if F.shape[1] == 2:
# Calculate the normal to the plane spanned by the columns of
normal_vector = np.cross(F[:, 0], F[:, 1])
# Plot the plane
The code snippet is useful for xx, yy = np.meshgrid(np.linspace(-3, 3, 10), np.linspace(-3, 3
def plot_matrix_and_subspace(F) understanding the vectors in zz = (-normal_vector[0] * xx - normal_vector[1] * yy) / normal
the 3D space. ax.plot_surface(xx, yy, zz, alpha=0.5, color='green', label='S
# Set plot limits and labels
ax.set_xlim([-3, 3])
ax.set_ylim([-3, 3])
ax.set_zlim([-3, 3])
ax.set_xlabel('$x_{1}$')
ax.set_ylabel('$x_{2}$')
ax.set_zlabel('$x_{3}$')
#ax.legend()
plt.show()

class LoRALayer(torch.nn.Module):
The provided code is useful def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
for defining the parameters of std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
the 'LoRALayer' module self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_de
nn.Parameter during the training. The self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
'LoRALayer' has been used self.alpha = alpha
as an intermediate layer in a def forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
simple neural network. return x

LinearWithLoRA class This code snippet defines the class LinearWithLoRA(torch.nn.Module):


custom neural network layer def __init__(self, linear, rank, alpha):
super().__init__()
called 'LoRALayer' using self.linear = linear.to(device)
PyTorch. It uses self.lora = LoRALayer(
'nn.Parameter' to create linear.in_features, linear.out_features, rank, alpha

about:blank 7/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example


learnable parameters for ).to(device)
optimizing the training def forward(self, x):
return self.linear(x) + self.lora(x)
process.

To fine-tune with LoRA, first,


load a pretrained
TextClassifier model with
LoRA (while freezing its
layers), load its pretrained
state from a file, and then
disable gradient updates for
all its parameters to prevent
further training. Here, you from urllib.request import urlopen
will load a model that was import io
pretrained on the AG NEWS model_lora=TextClassifier(num_classes=4,freeze=False)
model_lora.to(device)
data set, which is a data set urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storag
that has 4 classes. Note that stream = io.BytesIO(urlopened.read())
Applying LoRA when you initialize this state_dict = torch.load(stream, map_location=device)
model, you set num_classes model_lora.load_state_dict(state_dict)
to 4. Moreover, the pretrained # Here, you freeze all layers:
for parm in model_lora.parameters():
AG_News model was trained parm.requires_grad=False
with the embedding layer model_lora
unfrozen. Hence, you will
initialize the model with
freeze=False. Although you
are initializing the model
with layers unfrozen and the
wrong number of classes for
your task, you will make
modifications to the model
later that correct this.

ranks = [1, 2, 5, 10]


alphas = [0.1, 0.5, 1.0, 2.0, 5.0]
results=[]
accuracy_old=0
# Loop over each combination of 'r' and 'alpha'
for r in ranks:
for alpha in alphas:
print(f"Testing with rank = {r} and alpha = {alpha}")
model_name=f"model_lora_rank{r}_alpha{alpha}_AGtoIBDM_final_ad
model_lora=TextClassifier(num_classes=4,freeze=False)
model_lora.to(device)
urlopened = urlopen('https://cf-courses-data.s3.us.cloud-objec
stream = io.BytesIO(urlopened.read())
state_dict = torch.load(stream, map_location=device)
The given code spinet model_lora.load_state_dict(state_dict)
evaluates the performance of for parm in model_lora.parameters():
a text classification model parm.requires_grad=False
varying configurations of model_lora.fc2=nn.Linear(in_features=128, out_features=2, bias
'LoRALayer'. It assesses the model_lora.fc1=LinearWithLoRA(model_lora.fc1,rank=r, alpha=alp
Select rank and alpha optimizer = torch.optim.Adam(model_lora.parameters(), lr=LR)
combination of rank and scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer,
alpha hyperparameters, trains model_lora.to(device)
the model, and records the train_model(model_lora, optimizer, criterion, train_dataloader
accuracy of each accuracy=evaluate(valid_dataloader , model_lora, device)
configuration. result = {
'rank': r,
'alpha': alpha,
'accuracy':accuracy
}
# Append the dictionary to the results list
results.append(result)
if accuracy>accuracy_old:
print(f"Testing with rank = {r} and alpha = {alpha}")
print(f"accuracy: {accuracy} accuracy_old: {accuracy_old}"
accuracy_old=accuracy
torch.save(model.state_dict(), f"{model_name}.pth")
save_list_to_file(cum_loss_list, f"{model_name}_loss.pkl")
save_list_to_file(acc_epoch, f"{model_name}_acc.pkl")

Sets up the training


components for the model,
defining a learning rate of 1, LR=1
using cross-entropy loss as criterion = torch.nn.CrossEntropyLoss()
model_lora model the criterion, optimizing with optimizer = torch.optim.SGD(model_lora.parameters(), lr=LR)
stochastic gradient descent scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
(SGD), and scheduling the
learning rate to decay by a
factor of 0.1 at each epoch.

dataset_name = "imdb"
ds = load_dataset(dataset_name, split = "train")
The data set is loaded using N = 5
the load_dataset function for sample in range(N):
print('text',ds[sample]['text'])
load_dataset from the data set's library,
print('label',ds[sample]['label'])
specifically loading the ds = ds.rename_columns({"text": "review"})
"train" split. ds
ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

about:blank 8/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example

del(ds)
dataset_name="imdb"
ds = load_dataset(dataset_name, split="train")
ds = ds.rename_columns({"text": "review"})
def build_dataset(config, dataset_name="imdb", input_min_text_length=2
"""
Build dataset for training. This builds the dataset from `load_dat
customize this function to train the model on its own dataset.
Args:
dataset_name (`str`):
The name of the dataset to be loaded.
Returns:
dataloader (`torch.utils.data.DataLoader`):
Incorporates the necessary The dataloader for the dataset.
steps to build a data set """
build_dataset
object for use as an input to tokenizer = AutoTokenizer.from_pretrained(config.model_name)
PPOTrainer. tokenizer.pad_token = tokenizer.eos_token
# load imdb with datasets
ds = load_dataset(dataset_name, split="train")
ds = ds.rename_columns({"text": "review"})
ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)
input_size = LengthSampler(input_min_text_length, input_max_text_l
def tokenize(sample):
sample["input_ids"] = tokenizer.encode(sample["review"])[: inp
sample["query"] = tokenizer.decode(sample["input_ids"])
return sample
ds = ds.map(tokenize, batched=False)
ds.set_format(type="torch")
return ds

gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample


def generate_some_text(input_text,my_model):
# Tokenize the input text
Tokenizes input text, input_ids = tokenizer(input_text, return_tensors='pt').input_ids.t
Text generation function generates a response, and generated_ids = my_model.generate(input_ids,**gen_kwargs )
decodes it. # Decode the generated text
generated_text_ = tokenizer.decode(generated_ids[0], skip_special_
return generated_text_

This code snippet defines a # Instantiate a tokenizer using the BERT base cased model
function tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
'compare_models_on_dataset' # Define a function to tokenize examples
for comparing the def tokenize_function(examples):
performance of two models # Tokenize the text using the tokenizer
Tokenizing data # Apply padding to ensure all sequences have the same length
by initializing generation # Apply truncation to limit the maximum sequence length
parameters and setting the return tokenizer(examples["text"], padding="max_length", truncatio
batch size, preparing the data # Apply the tokenize function to the dataset in batches
set in the pandas format, and tokenized_datasets = dataset.map(tokenize_function, batched=True)
sampling the batch queries.

def train_model(model,tr_dataloader):
# Create a progress bar to track the training progress
The train_model function progress_bar = tqdm(range(num_training_steps))
trains a model using a set of # Set the model in training mode
training data provided model.train()
tr_losses=[]
through a dataloader. It
# Training loop
begins by setting up a for epoch in range(num_epochs):
progress bar to help monitor total_loss = 0
the training progress visually. # Iterate over the training data batches
The model is switched to for batch in tr_dataloader:
# Move the batch to the appropriate device
training mode, which is
batch = {k: v.to(device) for k, v in batch.items()}
necessary for certain model # Forward pass through the model
behaviors like dropout to outputs = model(**batch)
work correctly during # Compute the loss
training. The function loss = outputs.loss
# Backward pass (compute gradients)
Training loop processes the data in batches
loss.backward()
for each epoch, which total_loss += loss.item()
involves several steps for # Update the model parameters
each batch: transferring the optimizer.step()
data to the correct device # Update the learning rate scheduler
lr_scheduler.step()
(like a GPU), running the
# Clear the gradients
data through the model to get optimizer.zero_grad()
outputs and calculate loss, # Update the progress bar
updating the model's progress_bar.update(1)
parameters using the tr_losses.append(total_loss/len(tr_dataloader))
#plot loss
calculated gradients, plt.plot(tr_losses)
adjusting the learning rate, plt.title("Training loss")
and clearing the old plt.xlabel("Epoch")
gradients. plt.ylabel("Loss")
plt.show()

evaluate_model function Works similarly to the def evaluate_model(model, evl_dataloader):


train_model function but is # Create an instance of the Accuracy metric for multiclass classif
metric = Accuracy(task="multiclass", num_classes=5).to(device)
used for evaluating the # Set the model in evaluation mode
model's performance instead model.eval()
of training it. It uses a # Disable gradient calculation during evaluation
dataloader to process data in with torch.no_grad():
batches, setting the model to # Iterate over the evaluation data batches
for batch in evl_dataloader:
evaluation mode to ensure # Move the batch to the appropriate device

about:blank 9/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example


accuracy in measurements batch = {k: v.to(device) for k, v in batch.items()}
and disabling gradient # Forward pass through the model
outputs = model(**batch)
calculations since it's not # Get the predicted class labels
training. The function logits = outputs.logits
calculates predictions for predictions = torch.argmax(logits, dim=-1)
each batch, updates an # Accumulate the predictions and labels for the metric
accuracy metric, and finally, metric(predictions, batch["labels"])
# Compute the accuracy
prints the overall accuracy accuracy = metric.compute()
after processing all batches. # Print the accuracy
print("Accuracy:", accuracy.item())

def llm_model(prompt_txt, params=None):


model_id = 'mistralai/mixtral-8x7b-instruct-v01'
default_params = {
"max_new_tokens": 256,
"min_new_tokens": 0,
"temperature": 0.5,
"top_p": 0.2,
"top_k": 1
}
if params:
This code snippet defines
default_params.update(params)
function 'llm_model' for parameters = {
generating text using the GenParams.MAX_NEW_TOKENS: default_params["max_new_tokens"], #
language model from the GenParams.MIN_NEW_TOKENS: default_params["min_new_tokens"], #
mistral.ai platform, GenParams.TEMPERATURE: default_params["temperature"], # this r
GenParams.TOP_P: default_params["top_p"],
llm_model specifically the 'mitral-8x7b-
GenParams.TOP_K: default_params["top_k"]
instruct-v01' model. The }
function helps in customizing credentials = {
generating parameters and "url": "https://us-south.ml.cloud.ibm.com"
interacts with IBM Watson's }
project_id = "skills-network"
machine learning services.
model = Model(
model_id=model_id,
params=parameters,
credentials=credentials,
project_id=project_id
)
mixtral_llm = WatsonxLLM(model=model)
response = mixtral_llm.invoke(prompt_txt)
return response

This code snippet maps


numerical labels to their
corresponding textual
descriptions to classify tasks.
This code helps in machine class_names = {0: "negative", 1: "positive"}
class_names learning to interpret the class_names
output model, where the
model's predictions are
numerical and should be
presented in a more human-
readable format.

This code snippet uses


'AutoTokenizer' for
preprocessing text data for
DistilBERT, a lighter version
of BERT. It tokenizes input
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
DistilBERT tokenizer text into a format suitable for
model processing by
converting words into token
IDs, handling special tokens,
padding, and truncating
sequences as needed.

This code snippet tokenizes my_tokens=tokenizer(imdb['train'][0]['text'])


# Print the tokenized input IDs
text data and inspects the print("Input IDs:", my_tokens['input_ids'])
resulting token IDs, attention # Print the attention mask
Tokenize input IDs masks, and token type IDs print("Attention Mask:", my_tokens['attention_mask'])
for further processing the # If token_type_ids is present, print it
natural language processing if 'token_type_ids' in my_tokens:
print("Token Type IDs:", my_tokens['token_type_ids'])
(NLP) tasks.

This code snippet explains


how to use a tokenizer for
def preprocess_function(examples):
preprocessing text data from return tokenizer(examples["text"], padding=True, truncation=True,
the IMDB data set. The small_tokenized_train = small_train_dataset.map(preprocess_function, b
Preprocessing function tokenizer tokenizer is applied to review small_tokenized_test = small_test_dataset.map(preprocess_function, bat
the training data set and medium_tokenized_train = medium_train_dataset.map(preprocess_function,
convert text into tokenized medium_tokenized_test = medium_test_dataset.map(preprocess_function, b
input IDs, an attention mask,
and token type IDs.

compute_metrics funcion Evaluates model performance def compute_metrics(eval_pred):


using accuracy. load_accuracy = load_metric("accuracy", trust_remote_code=True)
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = load_accuracy.compute(predictions=predictions, reference

about:blank 10/11
19/01/2025, 18:50 about:blank

Package/Method Description Code example


return {"accuracy": accuracy}

config_bnb = BitsAndBytesConfig(
load_in_4bit=True, # quantize the model to 4-bits when you load it
bnb_4bit_quant_type="nf4", # use a special 4-bit data type for wei
Defines the quantization bnb_4bit_use_double_quant=True, # nested quantization scheme to qu
Configure BitsAndBytes
parameters. bnb_4bit_compute_dtype=torch.bfloat16, # use bfloat16 for faster c
llm_int8_skip_modules=["classifier", "pre_classifier"] # Don't co
)

Maps IDs to text labels for


id2label = {0: "NEGATIVE", 1: "POSITIVE"}
id2label the two classes in this
problem.

Swaps the keys and the


label2id = dict((v,k) for k,v in id2label.items())
label2id values to map the text labels
to the IDs.

This code snippet initializes a


tokenizer using text data
from the IMDB data set,
creates a model called
model_qlora = AutoModelForSequenceClassification.from_pretrained("dist
model_qlora for sequence id2la
classification using label
model_qlora DistilBERT, and configures num_l
with id2label and label2id quant
mappings. This code )
provides two output labels,
including quantization
configuration using
config_bnb settings.

This code snippet initializes


training arguments to train a training_args = TrainingArguments(
model. It specifies the output output_dir="./results_qlora",
directory for results, sets the num_train_epochs=10,
number of training epochs to per_device_train_batch_size=16,
per_device_eval_batch_size=64,
training_args 10 and the learning rate to
learning_rate=2e-5,
2e-5, and defines the batch evaluation_strategy="epoch",
size for training and weight_decay=0.01
evaluation. This code also )
specifies the assessment
strategies for each epoch.

Designed to convert a list of def text_to_emb(list_of_text,max_input=512):


text strings into their data_token_index = tokenizer.batch_encode_plus(list_of_text, add_
text_to_emb question_embeddings=aggregate_embeddings(data_token_index['input_i
corresponding embeddings return question_embeddings
using a pre-defined tokenizer.

# Define the model name or path


This code snippet defines the model_name_or_path = "gpt2"
model name to ‘gpt2’ and # Initialize tokenizer and model
initializes the token and tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path, use_fast
model using the GPT-2 model = GPT2ForSequenceClassification.from_pretrained(model_name_or_pa
model_name_or_path # Add special tokens if necessary
model. In this code, add tokenizer.pad_token = tokenizer.eos_token
special tokens for padding by model.config.pad_token_id = model.config.eos_token_id
keeping the maximum # Define the maximum length
sequence length to 1024. max_length = 1024

about:blank 11/11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy