Parameter Efficient Fine Tuning 1735415619
Parameter Efficient Fine Tuning 1735415619
https://medium.com/@lmpo/parameter-efficient-fine-tuning-of-large-language-models-4ed51860e1da
Content
• Review of Finetuning Approaches
• Parameter Efficient Finetuning (PEFT)
§ Adapter Layer
§ LoRA (Low Rank Adaptation)
§ QLoRA (Quantized LoRA)
§ DoRA (Optional)
§ MoRA (Optional)
§ SBoRA and Multi-SBoRA (Optional)
2
Transfer Learning and Finetuning
• Transfer learning and fine-tuning are two key techniques in machine learning that leverage
pre-trained models to improve performance on new tasks.
• While they are often used interchangeably, they have distinct methodologies and applications.
• This approach is based on the idea that a model trained on one task can be adapted to perform
well on another related task.
Transfer Learning
Fine-Tune LLM
3
Finetuning of Convolutional Neural Networks (CNNs)
• In computer vision application with CNNs, finetuning is often applied to pre-trained
models like ResNet and EfficientNet, which were initially trained using supervised
learning on large, labeled datasets.
• There are 3 popular finetuning approaches:
(1) Full Finetuning (2) Feature-based Approach (3) Top-Layer Finetuning
Update
all layers
Update
Update a few of
classifier
top layers
only
4
Finetuning of Large Language Models
• In Natural Language Processing (NLP), finetuning is commonly applied to
pretrained large language models (LLMs) like BERT, GPT-3, which are initially
trained using self-supervised learning on large-scale unlabeled corpora, with
labels generated automatically from the data itself.
5
LLM Finetuning: From General to Specific
• In the realm of NLP, finetuning of LLMs is a crucial step in transforming
general-purpose pretrained models into specialized models tailored to meet
the unique demands of specific applications.
• This process effectively bridges the gap between the generic, pretrained
models and the nuanced requirements of a particular task or domain.
§ For example, finetuning a pretrained GPT-3 model on a dataset of medical
reports and patient notes enables it to adapt to complex medical terminology
and jargon, significantly enhancing its performance in generating accurate
patient reports.
§ This targeted finetuning unlocks the full potential of LLMs in specialized
applications.
6
LLM Training Pipelines
Step 1 Step 2 Step 3
Pre-training Finetuning Human Alignment
(Self-Supervised Training) (Supervised Finetuning) (RFHF w/PPO or DPO)
7
Parameter-Efficient Finetuning
(PEFT)
LLMs are Becoming Very Large Indeed
The size of LLMs has been rapidly increasing, with models like GPT-3 having 175 billion parameters,
and recent models like Google's PaLM surpassing the trillion-parameter mark, enabling more
capable but also more resource-intensive models.
9
Naively Fine-Tuning LLaMA-3-8B takes 128GB of RAM!
• Fine-tuning small models like LLaMA3 8B on regular consumer GPUs can be
challenging due to the significant memory requirements:
1. Memory Requirements: LLaMA3 8B has 8 billion parameters and if it’s loaded in full-precision
(float32 format-> 4 bytes/parameter), then the total memory requirements for loading the model
would be numberOfParams*bytesPerParam = 8 billion*4 = 32GB of memory.
• Given that many consumer GPUs/ free versions of software like Google Colab have memory constraints
(e.g., NVIDIA T4 16GB on Google Colab), the model cannot even be loaded!
2. Fine-Tuning memory requirements: In the case of full fine-tuning with the regular 8bit Adam
optimizer using a half-precision model (2 bytes/param), we need to allocate per parameter: 2
bytes for the weight, 2 bytes for the gradient, and 12 bytes for the Adam optimizer states. This
results in a total of 16 bytes per trainable parameter, requiring over 120GB of GPU memory!!
• This would require at least 3A40s with 48GB GPU VRAM, which would mean fine-tuning wouldn’t be
accessible by public.
https://medium.com/polo-club-of-data-science/memory-requirements-for-fine-tuning-llama-2-80f366cba7f5
10
Full Fine-Tuning (FFT)
• Full Fine-Tuning is required to updates all pre-trained model parameters to adapt to a
new task, leading to improved performance. However, FFT has two main drawbacks:
1. Computational expense: Even fine-tuning small models like LLaMA3 with 8B
parameters can be resource-intensive.
2. Storage costs: Saving entire models for each checkpoint (finetuned model) can also
be storage-prohibitive.
FFT needs a copy for
each downstream task
Finetuned Model for Task 1
FFT 8B
11
Parameter-Efficient Finetuning (PEFT)
• To address the issues of FFT limitations, PEFT methods were developed,
adapting only a small subset of a model's parameters to a new task.
• The importance of PEFT in practical LLM applications lies in its ability to:
1. Lower hardware requirements and reduce memory needs
2. Speed up training times and reduce GPU usage
3. Improve modeling performance by reducing overfitting
4. Minimize storage needs by sharing weights across tasks
12
Reduce the Number of Parameters by PEFT
• PEFT enables the reuse of pretrained model weights, requiring only a
small number of additional task-specific parameters.
8B + 4 x 0.08B = 8.32B parameters
Task-specific para-
(33.28GB Storage) 80M
Only 26% of the FFT storage
meters for task 1
Task-specific para-
80M meters for task 2
LLaMA-3-8B PEFT LLaMA-3-8B
Model Model Task-specific para-
80M meters for task 3
8B parameters 8B parameters
Task-specific para-
80M
meters for task 4
13
PEFT Techniques
1. Adapter Tuning (2019) – Add new intermediate modules
2. Prefix Tuning (2021) – Add additional prefixes
3. Prompt Tuning (2021) – Adapts input prompts
4. LoRA (2021) – Low-Rank decomposition
5. QLoRA (2023) – Quantized LoRA
Pretrained 𝐁
6. DoRA (2024) – Weight-Decomposed LoRA Weights 𝑟
𝐖! ∈ ℝ"×$
7. MoRa (2024) – High-Rank Updating 𝐀
All tasks share the same original PLM; the adapters are task-specific modules => better robustness, storage-efficient
https://www.youtube.com/watch?v=R3jZVKUlSjA
Adapter Performance on Finetuning BERT
1. Adapter-trained BERT models achieve
similar performance to fully finetuned
ones while training only 3.6% of the
parameters. This suggests significant
parameter efficiency.
2. Adapters outperform top-layer finetuning
using even fewer parameters. This implies
higher efficiency than training just the
output layers of BERT.
18
Other Types of Adapter
• https://adapterhub.ml/
https://arxiv.org/abs/2210.06175
19
Low-Rank Adaptation (LoRA)
https://arxiv.org/pdf/2106.09685.pdf
LoRA (Edward Hu et al., 2021-06)
• Low-Rank Adaptation (LoRA) is a groundbreaking technique of PEFT for LLMs.
• It introduces a parallel low-rank adapter to the weights of linear layers
reducing memory overhead and computational costs during finetuning.
https://arxiv.org/abs/2106.09685
Motivation of LoRA
• Core Finding: Low Intrinsic Dimensionality in Language Models
• Significance of Intrinsic Dimensionality:
§ Intrinsic dimensionality is a crucial metric that explains why large
language models are efficiently fine-tunable with limited data.
• Brader Impact:
§ Understanding intrinsic dimensionality could lead to more resource-efficient and
effective ways to train and deploy language Models
What is LoRA? Rank-1 Matrix
2 10 1
• LoRA – Low-Rank Adaptation 𝑟𝑎𝑛𝑘 4 20 2 = 1
6 30 3
§ Low-rank: Rank 𝑟 of the matrix is smaller
Rank-2 Matrix
than matrix’s dimension 𝑑
2 10 1
§ Rank: Minimum number of independent 𝑟𝑎𝑛𝑘 7 20 2 = 2
6 30 3
rows/columns
Rank-3 Matrix
§ Adaptation: Fine-tuning of models
2 3 1
𝑟𝑎𝑛𝑘 7 5 2 = 3
6 1 3
𝑑=3
23
Recap: Matrix Rank
• The rank of a matrix 𝐀 ∈ ℝ'×) is equal to the minimum number of linearly independent
columns or rows, and always satisfies:
rank 𝐀 ≤ min 𝑑, 𝑘
• A matrix with rank 𝐀 = min 𝑑, 𝑘 is called a Full-Rank Matrix. For example, the follow
3x3 matrix has a rank of 3, making it a full-rank matrix:
2 3 1
rank 7 5 2 = 3
6 1 3
• A matrix with rank 𝐀 < min 𝑑, 𝑘 is called a Low-Rank Matrix. Here are two examples of
low-rank matrices:
2 10 1 2 10 1
rank 4 20 2 =1 rank 7 20 2 =2
6 30 3 6 30 3
24
Rep: Rank Matrix Decomposition
• Low-Rank Matrices can be Decomposed into Low-Dimensional Matrices
• A low-rank matrix can be decomposed into the product of two low-dimensional matrices.
For instance, a rank-1 3x3 matrix can be decomposed as follows:
2 10 1 1
4 20 2 = 2 × 2 20 30
6 30 3 3 1×3
3×3 3×1
§ This decomposition reduces the number of coefficients needed to represent the matrix from 9 to
6 = (3×1 + 1×3).
• In general, a rank-r 𝑛×𝑛 matrix can be decomposed into an 𝑛×𝑟 matrix and an 𝑟×𝑛 matrix.
• For r=1, the reduction in coefficients is: 𝑛* ⇒ 2𝑛
• The general reduction in coefficients for a rank-r matrix is: 𝑛* ⇒ 2 < 𝑟 < 𝑛
25
Recap: Matrix Representation of FFN
• Formulation of Hidden Layer 1: 𝐚(() = 𝑔 𝒛(() = 𝑔 𝐖 (() 𝐱 + 𝐛 (()
Input (&) (&) (&) (&) (&)
𝑤&,& 𝑤&,) 𝑤&,* 𝑤&,+ 𝑏&
𝑥)
(&) (&) (&) (&) (&)
1 Layer 1 𝑥 𝑤),& 𝑤),) 𝑤),* 𝑤),+ 𝑏)
𝐱 = 𝑥* 𝐖 (&) = (&) (&) (&) (&)
𝐛 (&) = (&)
+ 𝑤*,& 𝑤*,) 𝑤*,* 𝑤*,+ 𝑏*
(&) 𝑥,
𝑥, 𝑎** 𝑎& (&)
𝑤+,&
(&)
𝑤+,)
(&)
𝑤+,*
(&)
𝑤+,+
(&)
𝑏+
(&)
𝑥* 𝑎+* 𝑎)
• Common pre-trained models have been
𝑥- 𝑎,*
(&)
𝑎* empirically shown that having a very low intrinsic
(&)
dimension (low-rank)
𝑥. 𝑎-* 𝑎+ • In other words, there exists a low dimension
𝐖(*) , 𝐛(*)
𝐚(&)
reparameterization that is as effective for
finetuning as the full-parameter space.
https://arxiv.org/pdf/2012.13255.pdf
26
LoRA (Low-Rank Adaptation)
𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱
• Use Low-rank submodules to modify hidden
representations 𝑑 𝑑
§ Pretrained Weights: 𝐖0 ∈ ℝ1×2
§ Introduce two new smaller matrices 𝐀 ∈ ℝ3×2
and 𝐁 ∈ ℝ1×3 Pretrained 𝐁=0
𝐡 = 𝐖0 + ∆𝐖 𝐱 = 𝐖0 + 𝐁𝐀 𝐱 𝑘 𝑘
𝑘
where ∆𝐖 = 𝐁𝐀 is a low-rank matrix
https://arxiv.org/abs/2106.09685 𝐱
27
LoRA (Low-Rank Adaptation)
𝐡 = 𝐖! 𝐱 + 𝐁𝐀𝐱
𝑘
𝐀 ∈ ℝ,×$
𝐀 = 𝒩 0, 𝜎 )
Pretrained
= Weights
+
𝐁=0
𝑑 (Frozen) 𝑘
𝐖- ∈ ℝ.×0
28
LoRA (Low-Rank Adaptation)
𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱 𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱
𝑑 𝑑 𝑑 𝑑
𝐁=0 𝐀 = 𝒩 0, 𝜎 ) 𝑟
Pretrained Pretrained
Weights Weights
𝐁=0
𝑑 𝑟 𝑑 𝑑
(Frozen) (Frozen) 𝑘
𝐖- ∈ ℝ.×0 𝐖- ∈ ℝ.×0
𝐀 = 𝒩 0, 𝜎 )
𝑘 𝑘
𝐱 ∈ ℝ2×( 𝐱 ∈ ℝ2×(
29
LoRA: ∆𝐖 = 𝐁𝐀
𝐁 𝐀 ∆𝐖
ℎ& 𝑤)) 𝑤)* 𝑤)+ 𝑤), 𝑥& 0.15 −0.14 −0.21 0.612 𝑥&
ℎ 𝑤 𝑤** 𝑤*+ 𝑤*, 𝑥) −0.22 0.204 0.308 −0.86 𝑥)
𝐡 = ) = 𝐖! 𝐱 + ∆𝐖𝐱 = 𝑤*) 𝑤+* 𝑤++ 𝑤+, 𝑥* + −0.30 𝑥*
ℎ* +) −0.16 0.634 0.147
ℎ+ 𝑤,) 𝑤,* 𝑤,+ 𝑤,, 𝑥+ −0.07 −0.2 0.246 0.523 𝑥+
𝐖0 ∆𝐖
30
LoRA High Parameter Efficiency Example
• X% of weights trained => (a) lower memory footprint and (b) faster finetuning jobs
𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱
B.shape == (1024,8)
B parameter count
= 1024 x 8 = 8,192
W.shape == (1024,1024) Pretrained 𝐁 A.shape == (8, 1024)
W parameter count
Weights
A parameter count
= 10242 = 1,048,576 𝐖- ∈ ℝ.×0 𝐀
= 8 x 1024 = 8,192
31
Applying LoRA to Feed-Forward Networks
• LoRA can be applied to each fully-connected layer and the classifier heads
are the task-specific modules
+
Classifier
B
Head Feed-forward
down-project A
Transformer Layer 12 w/ LoRA
...
Nonlinearity
Transformer Layer 2 w/ LoRA
+
Transformer Layer 1 w/ LoRA B
Feed-forward
up-project
Embedding Layer A
32
Applying LoRA To Transformers
LoRA 𝐐
𝐊
𝐕
33
Which Weight Matrices?
https://arxiv.org/pdf/2106.09685.pdf
34
Scaling Factor 𝛼
• The scaling factor 𝛼 is used to adjust the output of matrices 𝐁 and 𝐀.
Scaling factor
𝛼
𝐖! + ∆𝐖 = 𝐖! + 𝐁𝐀
𝑟
Rank
• It is divided by the rank 𝑟, which represents the intrinsic dimension and determines the
level of decomposition or compression applied to the weights.
• Typically, the rank ranges from 1 to 64, while the scaling factor 𝛼 controls the amount of
change applied to the original model weights, striking a balance between the knowledge of
the pre-trained model and its adaptation to a new task.
• Both the 𝛼 and 𝑟 are hyperparameters, which need to be tuned.
§ Basically, the scaling factor 𝛼 helps in stabilizing other hyperparameters, such as learning rates, when
the rank is varied. By adjusting the rank and incorporating the scaling factor, one can explore different
levels of decomposition without needing to extensively tweak other parameters. This approach
simplifies the process of finding the optimal level of decomposition for a given task.
35
How Low-Rank can LoRA go?
• LoRA works even with extremely small values of r such as 4, 2, or even 1.
• On the WikiSQL and MultNLI problem datasets, the authors found no
statistically significant difference in performance when reducing the rank r=64
to r=1.
https://arxiv.org/pdf/2106.09685.pdf
LoRA Performance Parity with Fully Finetuned LLMs
37
Extremely Parameter Efficient Finetuning
And the number of traninable parameter is less than 1% of the total model size
38
LoRA Comparison on GPT-3
LoRA reduces the number of trainable parameters in GPT-3 by 5 orders of magnitude!
https://arxiv.org/pdf/2106.09685.pdf
39
Hugging Face PEFT
https://huggingface.co/docs/peft/en/index
We don't have to manually apply a low-rank decomposition to each layer individually. Instead, we can
use the "get_path_model" function, which takes care of this process for us.
40
Benefits of LoRA
• Less Parameters: LoRA reduces computational requirements during training,
leading to faster training and lower memory usage.
• Flexibility: Switch between different LoRA weights
• Seamless Integration: The ∆𝐖 (𝐁𝐀) weights from the rank decomposition can
be merged with the original model weights by simply adding them together,
without introducing any overhead during inference.
𝐡 = 𝐖𝐱 + 𝐁𝐀𝐱
𝐡 = 𝐖′𝐱
Merged
During Pretrained 𝐁 𝐡 = 𝐖! 𝐱 + 𝐁𝐀𝐱 Weights After
Weights
Training 𝑟 Training
𝐡 = 𝐖! + 𝐁𝐀 𝐱 𝐖′ ∈ ℝ'×)
'×)
𝐖& ∈ ℝ
𝐀
𝐖′
𝐱
𝐱 41
LoRA: A New Paradigm Shift in NLP
• LoRA enables us to adapt pretrained LLMs to specific downstream
tasks faster, more robustly, and with orders of magnitudes fewer
learnable parameters compared to standard fine-tuning.
• LoRA's success suggests low-rank, coarse-grained weight updates
during fine-tuning, akin to "remembering" over "learning".
• Lowest possible rank depends on downstream task difficulty relative
to pre-training.
• Lower ranks expected in earlier Transformer layers, higher ranks in
later layers.
QLoRA
https://arxiv.org/pdf/2305.14314.pdf
QLoRA: LoRA with 4-bit Quantization (2023-05)
• In LoRA, the pretrained weights 𝐖0 still account for large memory
§ LLaMA-3-70B model with 32-bit precision requires 820GB of GPU memory.
https://arxiv.org/pdf/2305.14314.pdf
45
Innovations of QLoRA
1. NF4 (4-bit NormalFloat):
• A specialized 4-bit floating-point format that normalizes weight values to the
range [-1, 1] before quantization, allowing for a more accurate representation of
the weight distribution and outperforming other 4-bit quantization techniques.
2. Double Quantization (DQ):
• A nested quantization technique that combines NF4 with further compression of
quantization constants to an 8-bit format, resulting in significant memory savings
(around 3GB for massive models like LLaMA-65B).
3. Page Optimizers:
• A technique that optimizes memory access patterns, reducing memory usage
and improving model performance (not described in detail in the provided text,
but mentioned as one of the innovations of QLoRA).
46
Block-wise k-bit Quantization
• Quantization is the process of discretizing an input from a representation that holds more
information to a representation with less information.
• It often means taking a data type with more bits and converting it to fewer bits, for example from 32-
bit floats to 8-bit Integers.
• To ensure that the entire range of the low-bit data type is used, the input data type is commonly
rescaled into the target data type range through normalization by the absolute maximum of the
input elements, which are usually structured as a tensor.
• For example, quantizing a 32-bit Floating Point (FP32) tensor into a Int8 tensor with range [-127,
127]:
127
𝐗 6789 = round :;+*
, 𝐗 6789 = round 𝑐 :;+* , 𝐗 678+*
absmax 𝐗
where c is the quantization constant or quantization scale.
• Dequantization is the inverse:
𝐗 6789
dequant 𝑐 :;+* , 𝐗 678+* = :;+* = 𝐗 678+*
𝑐
47
12*) -./+
𝐗 -./+
dequant 𝑐 ,𝐗 = 12*) = 𝐗 -./*)
𝑐
127
𝐗 -./0 = round 12*)
X 𝐗 -./0 = round 𝑐 12*) X 𝐗 -./*)
absmax 𝐗
12*) -./*)
𝐗 -./0
dequant 𝑐 X𝐗 = 12*) = 𝐗 -./*)
𝑐
16
𝐗 -./+ = round 12*)
X 𝐗 -./+ = round 𝑐 12*) X 𝐗 -./+
absmax 𝐗
&3
𝑐 12*) = is the quantization constant (scale factor)
456748 𝐗 #$%&
QLoRA
• Using the components described above, we define QLoRA for a single linear layer in the
quantized base model with a single LoRA adapter as follows:
𝐖& ∈ ℝ'×) 𝐀
𝐱 49
4-bit NormalFloat (NF4)
• According to the QLoRA paper, pre-trained parameters are generally in
accordance with a zero-centered normal distribution with a standard
deviation of σ. We can scale σ to transform all weights into a single fixed
distribution that fully adapts to the data range specified by QLoRA.
• Motivated by this, QLoRA calculates the values of qj based on the
quantiles of the normal distribution.
• The current problem is how to calculate 16 quantiles:
𝑞", … , 𝑞"# ∈ −1,1
4-bit NormalFloat (NF4)
• NF4 is an
information-
theoretically optimal
data type for normal
distributions
51
https://ai.plainenglish.io/qlora-key-quantization-and-fine-tuning-techniques-in-the-era-of-large-language-models-0fa05a961d27
Double Quantization
• Block-wise Quantization
§ We know that the essence of quantization is to map values from a larger range to
a smaller range. We can use a constant c to proportionally reduce the values. In
this way, we can easily use the same constant c to dequantize the quantized
values back to their original (approximate) form.
§ However, if our data contains outliers, this will affect the selection of c and cause
other values to collapse within a small range. Block-wise provides a solution to
this by quantizing one block at a time, with each block using its own independent
quantization constant c.
• Since quantization constants are typically stored as FP32, the memory usage
can become significant when there are a large number of blocks.
The Approach of QLoRA
• QLoRA divides the parameters into blocks of size 64.
§ Each block calculates a quantization constant, denoted as c.
• QLoRA further quantizes the quantization constants into FP8 using Double
Quant, with a block size of 256.
• This further reduces the memory consumption.
• Before Double Quant:
• Quantizing each parameter requires an additional 32/64 = 0.5 bits of memory.
• After Double Quant:
• Quantizing each parameter only requires an additional 8/64 + 32 / (64*256) = 0.127 bits of
memory.
Double Quantization: Reduce absmax constant size
55
Paged Optimizer: Prevent Memory Spikes
• Page-by-page transfers of memory from CPU <=> GPU as needed
§ Lazy and does not need to be managed (no offloading, everything is automatic).
• The Page Optimizer mechanism allows for transferring the optimizer to memory
when GPU memory is limited.
§ It can be loaded back when the optimizer state needs to be updated.
§ It is said to effectively reduce the peak occupancy of GPU memory.
• The paper of QLoRA states that this mechanism is necessary to train a model
with 33 billion parameters on a 24GB GPU.
• This mechanism can be easily configured by setting the parameters of Training
Arguments:
§ optim = ‘paged_adamw_32bit’
56
Paged Optimizer: Prevent Memory Spikes
• Paged Optimizer works like this:
1. A large mini-batch (long sequence length) uses more GPU memory than
available
2. Paging engine evicted optimizer state to CPU
3. During optimizer step all optimizer states are prefetched to the GPU
4. Do an optimizer step
5. Continue to process everything on the GPU as long as the mini-batch does not
cause an eviction
57
How does QLoRA reduce memory to 14GB?
• Below is the calculation to determine the memory requirements for fine-tuning LLaMA3–8B with
QLoRA.
§ Memory requirement for loading the 4-bit quantized model:
• The LLaMA3-8B base model has about 8 billion parameters, and each parameter is quantized to 4 bits
(0.5 bytes). Hence, loading the model would take about 4GB ( 8 billion parameters × 0.5 bytes).
§ Memory requirement per trainable parameter consists of:
• Weight: 0.5 bytes
• LoRA parameters: 2 bytes
• AdamW optimizer states: 2 bytes
• Gradients (always in fp32): 4 bytes
• Therefore, the memory per trainable parameter is 8.5 bytes ( ≈ 0.5 + 2 + 2 + 4)
§ Total memory requirement for trainable parameters:
• LoRA results in an average of 0.4-0.7% trainable parameters, assuming that there are 0.6% of trainable parameters
• The total trainable parameters memory :
Memory per parameter * parameters = 8.5 bytes * 48 million (0.6% of 8B parameters) ≈ 0.408 GB
How does QLoRA reduce memory to 14GB?
• Total memory requirement for LLaMA3-8B QLoRA Training: The total memory requirement for
QLoRA training is around 4.1GB, which includes the memory for the base model and the memory for
trainable parameters ≈ 0.408 GB, resulting in a total training memory requirement of about ≈ 4–5
GB (depending on the number of trainable parameters).
• Memory required for Inference: If we load the base model in 16-bit precision and merge the LoRA
weights of the fine-tuned model, we would at-most use 14 GB of GPU memory for a sequence
length of 2048. This memory cost is derived from loading the model in float16 precision and includes
activations, temporary variables and hidden states, which are always in full-precision (float32) format
and depend on many factors including sequence length, hidden size and batch size.
• Total memory requirements: So, the total memory requirement for QLoRA training with a 4-bit base
model and mixed-precision mode, including loading the 32-bit model for inference, would be
almost ≈ 14 GB depending on the sequence length.
• Thus, we can see that using quantization techniques like QLoRA along with PEFT can significantly
reduce memory requirements by up to 90%, thereby making fine tuning more accessible and
affordable!
How does QLoRA reduce memory to 14GB?
• Below is the calculation to determine the memory requirements for fine-tuning LLaMA3–8B with
QLoRA.
§ Memory requirement for loading the 4-bit quantized model:
• The LLaMA3-8B base model has about 8 billion parameters, and each parameter is quantized to 4 bits
(0.5 bytes). Hence, loading the model would take about 4GB ( 8 billion parameters × 0.5 bytes).
§ Memory requirement per trainable parameter consists of:
• Weight: 0.5 bytes
• LoRA parameters: 2 bytes
• AdamW optimizer states: 2 bytes
• Gradients (always in fp32): 4 bytes
• Therefore, the memory per trainable parameter is 8.5 bytes ( ≈ 0.5 + 2 + 2 + 4)
§ Total memory requirement for trainable parameters:
• LoRA results in an average of 0.4-0.7% trainable parameters, assuming that there are 0.6% of trainable parameters
• The total trainable parameters memory :
Memory per parameter * parameters = 8.5 bytes * 48 million (0.6% of 8B parameters) ≈ 0.408 GB
QLoRA
61
62
Large Models are Not easily accessible
https://www.youtube.com/watch?v=fQirE9N5q_Y
63
QLoRA
• QLoRA Hyperarameter settings:
§ Alpha determines the multiplier applied to the weight changes when added to the original
weights
• Scale multiplier = Alpha / Rank
• Microsoft LoRA repository sets to 2 x Rank
• QLorA wen with ¼ of Rank (alpha = 16, r = 64)
§ Droput is a percentage that randomly eaves out some weight changes each time to deter
overfitting
• QLoRA paper went with 0.1 for 7B-13B, 0.05 for 33B=65B models
• QLoRA paper has two interesting findings:
§ Training all layers of the network is necessary to match performance of full-parameter fine-tuning
§ Rank may not matter from 8 to 256
https://www.youtube.com/watch?v=t1caDsMzWBk
QLoRA Summary
• QLoRA uses NF4, double quantization, and paged optimizers combined
with LoRA to replicate 16-bit full finetuning performance at a 17x
smaller memory footprint.
• While evaluation is noisy, Guanaco models outperform existing open-
source models on the Vicuna benchmark.
65
Optional Content
§ DoRA (Optional)
§ MoRA (Optional)
§ SBoRA (Optional)
§ Multi-SBoRA (Optional)
https://www.youtube.com/watch?v=WLDehSkSIhY
66
DoRA (Wang et al., 2024-02)
Weight-Decomposed Low-Rank Adaptation
• In DoRA, the original model weights are Pretrained
Weight
Merged
Weight
decomposed into a Magnitude vector 𝐖' ∈ ℝ)×+ 𝐖 ∈ ℝ)×+
𝐕
𝐦 and a Direction matrix 𝐕 prior to Decompose
? Merge
(Initialize)
applying LoRA Magnitude Magnitude
∈ ℝ,×+ 𝐦 ∈ ℝ,×+
• The direction matrix is created by 𝐦 = 𝐖' (
𝐕' + ∆𝐕
§ 𝐦 ∈ ℝ)×0 is the magnitude vector (trainable) with initial value of 1/ (
𝐖- H
∆𝐕 ∈ ℝ)×+
• The magnitude vectors 𝐦 are small, and are trained Pretrained 𝐁
normally Weight 𝑟
𝐕' ∈ ℝ)×+ 𝐀
• The direction matrices 𝐕 are big, LoRA is used to fine-tune
it with frozen 𝐕0
DoRA Methodology
• DoRA optimize both the magnitudes and directions of the pre-trained weights. Since the directional
component is large in terms of the number of parameters, we further decompose it with LoRA.
𝐕 𝐕- + ∆𝐕 𝐖- + 𝐁𝐀
𝐖=𝐦 =𝐦 =𝐦
𝐕 H 𝐕- + ∆𝐕 H 𝐖- + 𝐁𝐀 H
§ 𝐕& + ∆𝐕 is just a normalization term and DoRA takes as a constant. It won’t receive gradients
2
during backpropagation.
• LoRA is applied to the transformer's query and value matrices, and the magnitude and directional
differences between the original and finetuned weight matrices are calculated.
• For inference, the magnitude vectors and updated direction matrices can be combined back into
updated weights for the original model
Analysis of Parameter Update Correlations
• For evaluating the parameter update correlations between full finetuning (FT), LoRA and
DoRA, the authors calculated:
§ Directional Change using cosine similarity ∆𝐃978 :
K
∑0L?) 𝐦L,0 L
IJ − 𝐦-
∆𝐌IJ =
𝑘
at four training step checkpoints.
where t is the number of training steps.
Parameter Update Correlations: FFT vs LoRA
• DoRA’s authors found that changes to magnitude ∆𝐌 and direction ∆𝐃 for full
finetuning (FFT) are largely independent (small negative correlation). Moreover, they
spread with a high variance
• However, LoRA creates highly positive correlated changes (positive slope) with a
significantly low variance, which hurting performance
§ LoRA cannot make slight directional changes alongside significant magnitude alterations.
FFT
FFT
compress
performance in instruction tuning.
• MoRA addresses LoRA’s limitations in
knowledge enhancement, demonstrating
superior ability to memorize new information.
MoRA: High-Rank Updating for PEFT
• MoRA's innovation is its use of a compress function to project input vectors into a
lower-dimensional space, perform high-rank transformations using a smaller matrix,
and then apply a decompress function to project the result back into the original
higher-dimensional space.
𝐡 ∈ ℝ"×& 𝐡 ∈ ℝ"×&
decompress
Pre-trained 𝐁 Pretrained
trainable Weights Weights
𝑟 𝐌 𝑟×𝑟
'×) '×)
𝐖& ∈ ℝ 𝐖& ∈ ℝ
𝐀 compress
frozen
𝐱 ∈ ℝ$×& 𝐱 ∈ ℝ$×&
LoRA (r = 8) MoRA (r = 256)
MoRA: High-Rank Updating for PEFT
• MoRA's innovation is its use of compress and
decompress functions to project input vectors into a 𝐡 ∈ ℝ"×&
lower-dimensional space, perform high-rank
transformations with a smaller matrix, and then
decompress
project back to the original space. Pretrained
Weights
𝐌 𝑟×𝑟
∆𝐖𝐱 = 𝑓:;<=>?(𝐌 𝑓<=>? 𝐱 ) 𝐖& ∈ ℝ'×)
compress
𝐡 = 𝐖𝟎 𝐱 + 𝑓abcdef (𝐌 𝑓cdef 𝐱 )
• This approach, using non-parameterized operators,
𝐱 ∈ ℝ$×&
enables high-rank updates without increasing
MoRA (r = 256)
trainable parameters, differing from LoRA.
Parameter Efficiency of MoRA
• For example, it is given that the weight layer is 4096 x 4096, which means 16,777,216
parameters would need to be updated with FFT.
• If r = 8 is chosen with LoRA, it would result in 2 x (4096 x 8) = 65,536 parameters being updated.
• With MoRA, if a 256 x 256 matrix is chosen for M, it would mean 256 x 256 = 65,536 parameters
would need to be tuned, the same number as LoRA.
𝐡 ∈ ℝ"×& 𝐡 ∈ ℝ"×&
decompress
Pre-trained 𝐁 Pretrained
trainable Weights Weights
𝑟 𝐌 𝑟×𝑟
'×) '×)
𝐖& ∈ ℝ 𝐖& ∈ ℝ
𝐀 compress
frozen
𝐱 ∈ ℝ$×& 𝐱 ∈ ℝ$×&
LoRA (r = 8) MoRA (r = 256)
Non-Parameterized Compress and Decompress
• MoRA explored four non-parameterized methods for designing
compress and decompress operators:
1. Truncation: Simple, but can result in significant information loss.
2. Row and Column Sharing: Effective for larger ranks (r=128, 256), preserving
more input information.
3. Decomposition (for smaller ranks, r=8): Breaks input vectors into subvectors to
mitigate information loss.
4. Rotation (inspired by RoPE): Integrates rotation operators to boost the
expressive power of the matrix M, capturing nuanced differences between input
segments.
Performance of MoRA
• In DoRA evaluation experiments, the authors first focused on memorizing UUID pairs,
comparing MoRA with LoRA and FFT. Using ranks 8 and 256, MoRA demonstrated significant
improvements over LoRA while using the same number of trainable parameters.
• MoRA required fewer training steps to memorize UUID pairs compared to LoRA. At rank
256, MoRA achieved performance similar to FFT, with both methods able to memorize all
pairs within 500 steps.
Character-level accuracy of
memorizing UUID pairs by
generating the value of
corresponding key in 300,
500, 700 and 900 training
steps.
Performance of MoRA
• MoRA was evaluated across three fine-tuning tasks:
1. Instruction tuning on Tülu v2 dataset, with zero-shot and five-shot MMLU evaluations.
2. Mathematical reasoning on MetaMath dataset, with GSM8K and MATH evaluations.
3. Continual pretraining on biomedical and financial domains using PubMed abstracts and financial news data.
Performance of MoRA
• In these fine-tuning tasks, MoRA was compared against several methods including FFT, LoRA, LoRA+,
AsyLoRA, ReLoRA, and DoRA. Results showed that MoRA performed comparably to LoRA on
instruction tuning and mathematical reasoning tasks. However, MoRA outperformed LoRA in
continual pretraining for both biomedical and financial domains. Generally, higher ranks (256 vs 8)
improved performance, especially in mathematical reasoning tasks.
Summary of MoRA
• MoRA introduces non-parameter operators to reduce input dimensions and
increase output dimensions for the square matrix, allowing it to be merged
back into the LLM like LoRA. The method is evaluated across five tasks:
instruction tuning, mathematical reasoning, continual pretraining, memory,
and pretraining.
• Results show that MoRA outperforms LoRA on memory-intensive tasks and
achieves comparable performance on other tasks, demonstrating the
effectiveness of high-rank updating. The authors provide a detailed analysis of
their method, including various implementations of the compression and
decompression functions used in MoRA.
SBoRA (LM Po et al., 2024-07)
• SBoRA: Low-Rank Adaptation with Regional Weight Updates
• SBoRA enables regional weight updates and memory-efficient finetuning. The majority of
the finetuned model’s weights remain unchanged from the pre-trained weights.
• This characteristic of SBoRA is reminiscent of the modular organization of the human brain,
which efficiently adapts to new tasks.
SBoRA-FA and SBoRA-FB
• SBoRA (Standard Basis LoRA) adopts a unique approach, utilizing orthogonal
standard basis vectors to construct its projection matrices. These matrices,
designated as Asb for SBoRA-FA (Fixed A) and Bsb for SBoRA-FB (Fixed B)
Standard Orthogonal Basis
• SBoRA leverages standard basis vectors to construct the fixed 𝐀 or 𝐁 matrices of the LoRA
decomposition.
• Specifically, the shared orthogonal basis is identity matrix, comprising standard basis vectors (one-
hot vectors) as their rows or columns.
1 0 ⋯ 0
𝐈= 0 1 ⋯ 0
⋮ ⋮ ⋱ 0
0 0 0 1
• These standard basis vectors, denoted as 𝐞M , possess a single non-zero entry of 1 at index i:
Row standard basis vector: 𝐞M = 0 … 0 1 0 … 0
J
Column standard basis vector: 𝐞M = 0 … 0 1 0 … 0
• SBoRA initializes one fixed matrix (either 𝐀 or 𝐁) using these standard basis vectors 𝑒M , while the
other matrix is initialized with zeros, resulting in two variants: SBoRA-FA (Fixed Matrix A) and SBoRA-
FB (Fixed Matrix B).
SBoRA-FA: Regional Weight Update Example
• The finetuned weight 𝐖′ of SBoRA-FA can be represented as:
𝐖′ = 𝐖! + ∆𝐖 = 𝐖! + 𝐁𝐀 @A
• The update matrix ∆𝐖 = 𝐁𝐀 @A is very sparse, with most of the columns having zero weights due to the one-hot
nature of the standard basis subspace matrix 𝐀 @A .
• For example, when 𝑟 = 2 and 𝑘 = 4 with 𝐀 @A = 𝐞& 𝐞+ B , the ∆𝐖 will only have two non-zero columns as
Comparison of LLaMA-7B and LLaMA3-8B with different PEFT methods, and evaluating on the commonsense reasoning task. The
first row list the results of GPT-3.5 for reference. We report the accuracy (%) results for each of the eight sub- tasks as well as the
average accuracy (%), higher is better for all metrics. The column headers indicates TP for the number of trainable parameters and
r for the rank.
SBoRA: Arithmetic Reasoning Performance
Comparison of LLaMA-7B and LLaMA3-8B with different PEFT methods, and evaluating on the arithmetic reasoning task.
The first row list out the results of GPT-3.5 for reference. We report the accuracy results for each of the sub-tasks as well as
the average accuracy, higher is better for all metrics. The number of trainable parameters (TP) can be found in each row.
QSBoRA: MMLU Performance
Finetune LLaMA-7B/13B and LLaMA3-8B on Alpaca and Flan v2. We measured the performance
using MMLU benchmark and report the 5-shot average accuracy. The training settings and the
number of trainable parameters (TP) are included.
SBoRA: Diffusion Model based Text-to-Image Generation
Qualitative comparison of single-concept SBoRA diffusion model image generation. Reference images
for each concept is shown in the left column. LoRA-based method outperforms Custom Diffusion in
terms of fidelity. Furthermore, Orthogonal Adaptation and SBoRA exhibit comparable performance to
Mix-of-show, while also introducing orthogonal constraints that confer advantages in multi-concept
scenarios.
SBoRA: Diffusion Model based Text-to-Image Generation
Quantitative comparison result of SBoRA single concept tuning of image generation in diffusion model. Previous
methods have exhibited varying performance across different concepts or metrics. Custom Diffusion, for
instance, proves to be less effective in preserving image alignment, whereas Mix-of-show and Orthogonal
Adaptation encounter challenges in maintaining text alignment. In contrast, our proposed method achieves
comparable performance and results, demonstrating a more stable score across all concepts and metrics.
Multi-SBoRA: Diffusion Model based Text-to-Image Generation
Summary of SBoRA
• The SBoRA approach enables regional weight updates, preserving most of the pre-trained
model weights while efficiently adapting to new tasks. This localized learning process draws
parallels with the modular organization of the brain, where distinct cognitive functions are
localized to specific brain regions. This analogy highlights the potential of SBoRA to inspire
AI architectures that mimic the efficiency and adaptability of biological neural systems.
• SBoRA holds immense potential for further development, particularly in multi-task training.
The introduction of Multi-SBoRA would create a powerful framework for efficient
adaptation to multiple tasks. Each task would undergo independent, non-overlapping
weight fine-tuning with SBoRA, allowing the inte- gration of task-specific knowledge while
minimizing interference and maximizing the model’s capacity to leverage shared
information. This approach enables the maintenance of distinct capabilities for each task
within a single model, paving the way for more efficient and effective AI systems