Model Fine Tuning Documentation
Model Fine Tuning Documentation
Introduction
Model Orchestration, MO for short is a component where we fine-tune the model for the specific
activities of the knowledge work.
There is a list of activities in the knowledge work and for each activity task, there is a recommended
model. We can fine-tune the recommended model or we can choose another model and fine-tune it.
Process
1. Model Selection
2 Upload Dataset
3 Hyperparams
4: Inference
Explanation
Model Selection:
For each task of the knowledge work, we can choose the model based on the activity. In knowledge
work the base model specifies what type of the task it is, whether predictive, recognition or
generative
Thus for each task we have a model to select from, there is already a recommended model chosen at
the knowledge work step, but we can choose the model on our own.
Concepts in Model Selection:
Here we have 3 types of models based on the work: Predictive, recognition and Generative
Predictive model:
Involves using statistical algorithms and machine learning techniques to analyze historical data and
make predictions about the future or unknown events.
Once model is trained, it can be use to make predictions on new data where the target variable is
unknown.
Recognition model:
This is a model which focuses on identifying the patterns, features or attributes within a given
dataset
Image Recognition: Identifying objects, people, or scenes in images. For example, convolutional
neural networks (CNNs) are often used for tasks like facial recognition or identifying specific objects
in photos.
• Speech Recognition: Converting spoken language into text. This involves processing audio signals
and using models to recognize words and phrases. etc,etc...
Generative model:
Aims to learn the underlying patterns or attributes of data to generate new similar data
• Unlike discriminative models, which focus on distinguishing between different olasses or
categories, generative models learn the underlying distribution of the data and can create new
instances from that distribution
Image Generation: Creating new images that resemble a training set, such as generating realistic
photographs.. Generative Adversarial Networks (GANs) and Variational Autoericoders (VAEs) are
popular techniques used for this purpose
Text Generation: Producing coherent and contextually relevant text, such as writing stories, articles,
or dialogue. Models like GPT (Generative Pre-trained Transformer) are examples of generative
models for text…
Upload Dataset:
Here, we have to upload the dataset for training the model or specifically fine tuning with specific
task data.
The basic step is to upload a SFT(Supervised Fine Tuning) dataset for fine tuning the model. After we
have trained it with the SFT method then we can fine tune again using different techniques like
DPO,KTO,RM+PPO.
What is SFT, different fine-tuning Techniques such as DPO, KTO, and RM+PPO?
SFT
SFT (Supervised Fine tuning) is a fine tuning technique where we fine tune an already trained model
like gpt, llama to our task specific so that the model aligns to our task and is better at answering the
prompts related to our task. SFT is just about using a paired dataset where inputs are mapped with
specific, meaning instead of giving a normal csv dataset , we give a fine tuned dataset with curated
input and outputs mapped together to fine-tune our model so that it gives better responses for our
outputs, which helps align our model better to our required preference.
DPO
First Understanding:
DIrect preference Optimization,
A collection of triplets that map specific inputs to desired output.
It fine tunes models to generate responses that are more aligned with human preferences.
Format of dataset,
Prompt:””
Preferred response:””
Unpreferred response:””
Second Search.
Optimizes model based on human preferences using direct feedback. Human preference
here means as to which output is better or more aligned with user’s goal.
KTO
First understanding
Another type of preference dataset that can be used to train models to make decisions. The
model relies on simple binary preferences.
Second Search
Kahneman-Tversky Optimization
Aligns the model with human feedback. The method is inspired by principles of prospect
theory developed by Daniel Kahneman and Amos Tversky.
Dataset format:
Input:””
Output:””
Label:1 or 0
Training process:
KTO uses a class function that focuses on maximizing the likelihood of outputs with high
utility (label:1) while penalizing outputs with low utility (label:0).
It incorporates somewhat of cognitive biases from prospect theory where it penalizes
undesired output more than rewarding desired outputs. (Loss Aversion).
PPO
Proximal Policy Optimization, is a reinforcement learning algorithm, which is designed to
optimize policy (agent’s strategy for choosing action based on its current state) while
ensuring stable learning and efficient training.
PPO is a part of family of policy gradient methods where goal is to directly optimize the
policy rather than the value function.
What is value function?
It estimates how good it is for the agent to be in particular state(or take action in a particular
state by predicting expected cumulative future rewards)
PPO by itself is a RL algorithm, so to fine tune the models it needs a reward, Thus to get the
rewards we use Reward Modeling along with it.
Why with RM?
The flow is ,
- firstly SFT sets the base of fine tuning
- Then reward modeling adds human preference alignment
- Then PPO fine tunes the model using Reward Model applied as RL algo to fine tune.
- PPO uses feedback from reward model to optimize the policy.
In short , PPO is just a reinforcement learning algo with requires Reward model to fine tune
the model.
Hyperparameters Selection
Concepts in Hyperparameters
Introduction
Hyperparameters are configuration variables that are manually set before training a model.
These are adjustable parameters that control the training process of a machine learning
model
Why is it important?
Hyperparameters are essential when fine-tuning a model. They significantly influence the
performance, efficiency, and effectiveness of the fine-tuning process. It defines key features
like model architecture, learning rate of the model, model complexity, etc. Model’s
performance depends heavily on hyperparameters.
If you fine-tune a model without specifying hyperparameters, the process will rely on default
settings provided by the framework or library you're using.
Most libraries have default values for hyperparameters like learning rate, batch size, number
of epochs, optimizer type, etc.
These defaults are general-purpose and not optimized for your specific dataset, model, or
task. While this can work in some cases, it often leads to suboptimal results or wasted
computational resources.
Hyperparameter Tuning
It is the process of finding the configurations of hyperparameters that results in best
performance.
Usually the libraries gives you the list of hyperparameters to use when fine tuning with
specific technique.
But still there are techniques through which these hyperparameters are found.
Techniques:
1. GridSearchCV :
Brute force approach to fine the suitable HP. It fits the model using all possible
combination and see which are the ones giving better results
Inference
Introduction
Inference is the process of running live data through a trained model to make a prediction or
solve a task. Up until now, the model was at the training/fine tuning part, where we were still
configuring the model based on our specific task, now Inference is when we test the Model’s
ability to generate output on real unseen data to get real time outputs.
vLLM
vLLM is a library used for LLM inference and serving.
It’s built to streamline the process of deploying and serving large models for inference,
focusing on optimizing performance in terms of speed, resource utilization. It is designed to
provide fast and efficient inference even for large models
Also serving large models consumes massive amount of memory, but vLLM optimizes
memory usage to allow models to be served efficiently on available hardware
Evaluation Metrics
Evaluation metrics are quantitative measures used to assess the performance of a machine
learning model. They help in determining how well a model is performing on a given task,
whether it’s classification, regression, or other types of problems.
Evaluation metrics provide insights into how well a machine learning model performs on its
task.
The metrics that are being used in this platform are: Perplexity, ROUGE and BLEU.
Perplexity
Perplexity is a metric used to measure how well a model predicts a sample of text.In simpler
terms, perplexity can be thought of as a measure of how "confused" or uncertain the model
is when predicting the next word or sequence of words in a sentence.
A low perplexity value indicates that the model is good at predicting the next word in a
sequence, meaning the model is confident and accurate in its predictions.
A high perplexity value suggests that the model is uncertain or less accurate in its
predictions.
Perplexity is calculated by measuring the model’s likelihood of predicting the actual
sequence of words in the test data.
BLEU:
Bilingual Evaluation Understudy, Compares n-grams(sequence of words) in generated text
with reference text.
Captures surface level similarity between model output and reference output, meaning it
doesnt capture the semantic meaning but just word to word.
Better for translation tasks.
ROUGE:
Recall Oriented Understudy for Gisting Evaluation,
Measures n-gram overlap, particularly recall between generated and reference text.
Measures recall, focusing on whether important parts of reference text are present in the
generated text.
Check if generated text captures key into even if the wording is not identical.
Better for summarization, to see whether the important data is preserved in the summarized
text.
More Concepts
LoRA
LoRA means low rank adaptation, used for fine tuning the model.
One way is training the full model, which is Full parameter FT, training.
Other way is domain fine tuning, for specific domain.such as finance, education,etc.
Other way can be specific task fine tuning. Eg, q and a chatbot, text to sql.
So this LoRA has its variants as well where its used in Different way.
DoRA
Where the weights are trained, but here matrix is decomposed to lower ranks, suppose 3x3
matrix is decomposed into 3x1 and 1x3 matrix. So when we are training the model, all the
updated weights are decomposed into smaller matrices. Thus requiring less parameters to
store same values. This solves the resource constraints,
When the weights are decomposed, the number of trainable parameters are decomposed a
lot, making the model also lighter.
We can convert this 32 bits into 8 bits or even 4, then use the model,
This is what we call as quantization. When we quantize the model, we can inference it
quickly.
For fp32
1 bit for sign, 8 for exponent, remaining 23 for mantissa
For fb16
1 bit for sign, 5 for expo and remaining 10 for mantissa
Suppose you want to convert matrix where it has numbr inside it as from 0 to 1000 which are
fp32. You want to turn to unsigned int 8, meaning from 0-255 range
Asymmetric Quantization
[ -20.0 ….. 1000.0] to [0…..255]
Now if we want to do the same process, we do minmax scaling, 1000 - -20/ 255-0 =
1020/255 = 4.0, which is the scale factor, but when we do conversion of -20, we get -20/4 =
-5, now how can we store -5 when our 8 bit distribution starts from 0, so what we do is
To -5 we add +5 = to make it 0, now this +5 which made our number 0 is the zero point,
Modes of Quantization.
1. Post Training Quantization.(PTQ)
We have a pre trained model,--> then we perform calibration —> Quantized model →
any usecase
May have loss of data
QLoRA
Thus we can quantize LoRA as well to lower point such as 4 bit or 8 bit from higher points.
This enables us to use llms in less GPU power as well like colab.