0% found this document useful (0 votes)

129 views98 pages

Parameter Efficient Fine Tuning 1735415619

Fine tunning LLMs

Uploaded by

khaledkassem777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views98 pages

Parameter Efficient Fine Tuning 1735415619

Fine tunning LLMs

Uploaded by

khaledkassem777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Parameter-Efficient Fine-Tuning (PEFT)

AI with Deep Learning

EE4016
Prof. Lai-Man Po
Department of Electrical Engineering
City University of Hong Kong

https://medium.com/@lmpo/parameter-efficient-fine-tuning-of-large-language-models-4ed51860e1da
Content
• Review of Finetuning Approaches
• Parameter Efficient Finetuning (PEFT)
§ Adapter Layer
§ LoRA (Low Rank Adaptation)
§ QLoRA (Quantized LoRA)
§ DoRA (Optional)
§ MoRA (Optional)
§ SBoRA and Multi-SBoRA (Optional)
2
Transfer Learning and Finetuning
• Transfer learning and fine-tuning are two key techniques in machine learning that leverage
pre-trained models to improve performance on new tasks.
• While they are often used interchangeably, they have distinct methodologies and applications.
• This approach is based on the idea that a model trained on one task can be adapted to perform
well on another related task.

Transfer Learning
Fine-Tune LLM

3
Finetuning of Convolutional Neural Networks (CNNs)
• In computer vision application with CNNs, finetuning is often applied to pre-trained
models like ResNet and EfficientNet, which were initially trained using supervised
learning on large, labeled datasets.
• There are 3 popular finetuning approaches:
(1) Full Finetuning (2) Feature-based Approach (3) Top-Layer Finetuning

Update
all layers

Update
Update a few of
classifier
top layers
only

4
Finetuning of Large Language Models
• In Natural Language Processing (NLP), finetuning is commonly applied to
pretrained large language models (LLMs) like BERT, GPT-3, which are initially
trained using self-supervised learning on large-scale unlabeled corpora, with
labels generated automatically from the data itself.

• In most LLM-powered applications, Pretraining Finetuning

(Computationally Expensive) (Cheaper)
pretrained models are finetuned on
smaller labeled datasets for downstream
tasks, enhancing performance for
chatbots, summarization, and more.
• This approach is computationally more
efficient than training from scratch.

5
LLM Finetuning: From General to Specific
• In the realm of NLP, finetuning of LLMs is a crucial step in transforming
general-purpose pretrained models into specialized models tailored to meet
the unique demands of specific applications.
• This process effectively bridges the gap between the generic, pretrained
models and the nuanced requirements of a particular task or domain.
§ For example, finetuning a pretrained GPT-3 model on a dataset of medical
reports and patient notes enables it to adapt to complex medical terminology
and jargon, significantly enhancing its performance in generating accurate
patient reports.
§ This targeted finetuning unlocks the full potential of LLMs in specialized
applications.

6
LLM Training Pipelines
Step 1 Step 2 Step 3
Pre-training Finetuning Human Alignment
(Self-Supervised Training) (Supervised Finetuning) (RFHF w/PPO or DPO)

Text- Task-Specific Preference

Completion Finetuned Aligned
Model Model Model

Massive Curated Preference

amount of raw Task-specific data
(") (")
unlabeled data labeled data 𝑥 (") , 𝑦$ , 𝑦%

7
Parameter-Efficient Finetuning
(PEFT)
LLMs are Becoming Very Large Indeed

The size of LLMs has been rapidly increasing, with models like GPT-3 having 175 billion parameters,
and recent models like Google's PaLM surpassing the trillion-parameter mark, enabling more
capable but also more resource-intensive models.
9
Naively Fine-Tuning LLaMA-3-8B takes 128GB of RAM!
• Fine-tuning small models like LLaMA3 8B on regular consumer GPUs can be
challenging due to the significant memory requirements:
1. Memory Requirements: LLaMA3 8B has 8 billion parameters and if it’s loaded in full-precision
(float32 format-> 4 bytes/parameter), then the total memory requirements for loading the model
would be numberOfParams*bytesPerParam = 8 billion*4 = 32GB of memory.
• Given that many consumer GPUs/ free versions of software like Google Colab have memory constraints
(e.g., NVIDIA T4 16GB on Google Colab), the model cannot even be loaded!
2. Fine-Tuning memory requirements: In the case of full fine-tuning with the regular 8bit Adam
optimizer using a half-precision model (2 bytes/param), we need to allocate per parameter: 2
bytes for the weight, 2 bytes for the gradient, and 12 bytes for the Adam optimizer states. This
results in a total of 16 bytes per trainable parameter, requiring over 120GB of GPU memory!!
• This would require at least 3A40s with 48GB GPU VRAM, which would mean fine-tuning wouldn’t be
accessible by public.

https://medium.com/polo-club-of-data-science/memory-requirements-for-fine-tuning-llama-2-80f366cba7f5
10
Full Fine-Tuning (FFT)
• Full Fine-Tuning is required to updates all pre-trained model parameters to adapt to a
new task, leading to improved performance. However, FFT has two main drawbacks:
1. Computational expense: Even fine-tuning small models like LLaMA3 with 8B
parameters can be resource-intensive.
2. Storage costs: Saving entire models for each checkpoint (finetuned model) can also
be storage-prohibitive.
FFT needs a copy for
each downstream task
Finetuned Model for Task 1
FFT 8B

Finetuned Model for Task 2 8B 32B parameters

Pretrained LLaMA-3 Model are needed
Finetuned Model for Task 3 8B (128GB Storage)
8B parameters
Finetuned Model for Task 4 8B

11
Parameter-Efficient Finetuning (PEFT)
• To address the issues of FFT limitations, PEFT methods were developed,
adapting only a small subset of a model's parameters to a new task.
• The importance of PEFT in practical LLM applications lies in its ability to:
1. Lower hardware requirements and reduce memory needs
2. Speed up training times and reduce GPU usage
3. Improve modeling performance by reducing overfitting
4. Minimize storage needs by sharing weights across tasks

12
Reduce the Number of Parameters by PEFT
• PEFT enables the reuse of pretrained model weights, requiring only a
small number of additional task-specific parameters.
8B + 4 x 0.08B = 8.32B parameters
Task-specific para-
(33.28GB Storage) 80M
Only 26% of the FFT storage
meters for task 1

Task-specific para-
80M meters for task 2
LLaMA-3-8B PEFT LLaMA-3-8B
Model Model Task-specific para-
80M meters for task 3
8B parameters 8B parameters
Task-specific para-
80M
meters for task 4

13
PEFT Techniques
1. Adapter Tuning (2019) – Add new intermediate modules
2. Prefix Tuning (2021) – Add additional prefixes
3. Prompt Tuning (2021) – Adapts input prompts
4. LoRA (2021) – Low-Rank decomposition
5. QLoRA (2023) – Quantized LoRA
Pretrained 𝐁
6. DoRA (2024) – Weight-Decomposed LoRA Weights 𝑟
𝐖! ∈ ℝ"×$
7. MoRa (2024) – High-Rank Updating 𝐀

8. SBoRA (2024) – Standard Basis LoRA

LoRA
14
Adapter Tuning (Houlsby and Giurgiu et al., 2019-02)
• Adapter Tuning involves inserting Adapter modules, consisting of two bottleneck-
style feedforward modules, into each transformer block of a pretrained LLM.

• The adapter Layers are extra

trainable modules inserted
into the existing transformer
block that have a small
number of parameters and
can be finetuned while
keeping the weights of the
pretrained model fixed.
• As such, each finetuned
version of the model only has
a small number of task-
specific parameters associated
with it.
https://arxiv.org/pdf/1902.00751.pdf
https://cameronrwolfe.substack.com/p/easily-train-a-specialized-llm-peft 15
Why Adapter Layer with 2 Fully Connected Layers?
• Note that the fully connected layers of the adapters are usually relatively
small and have a bottleneck structure
§ Each adapter block's first fully connected layer projects the input down
onto a low-dimensional representation.
§ The second fully connected layer projects the input back into the input
dimension.
• For example, assume the first fully connected layer projects a 1024-
dimensional input down to 24 dimensions, and the second fully Low-dimensional
connected layer projects it back into 1024 dimensions. representation

• This means we introduced 1,024 x 24 + 24 x 1,024 = 49,152 weight

parameters.
§ In contrast, a single fully connected layer that reprojects a 1024-
dimensional input into a 1024-dimensional space would have 1024 x 1024 =
1,048,576 weight parameters.
16
Adapter (He et al. 2022)

All tasks share the same original PLM; the adapters are task-specific modules => better robustness, storage-efficient
https://www.youtube.com/watch?v=R3jZVKUlSjA
Adapter Performance on Finetuning BERT
1. Adapter-trained BERT models achieve
similar performance to fully finetuned
ones while training only 3.6% of the
parameters. This suggests significant
parameter efficiency.
2. Adapters outperform top-layer finetuning
using even fewer parameters. This implies
higher efficiency than training just the
output layers of BERT.

18
Other Types of Adapter
• https://adapterhub.ml/

https://arxiv.org/abs/2210.06175
19
Low-Rank Adaptation (LoRA)
https://arxiv.org/pdf/2106.09685.pdf
LoRA (Edward Hu et al., 2021-06)
• Low-Rank Adaptation (LoRA) is a groundbreaking technique of PEFT for LLMs.
• It introduces a parallel low-rank adapter to the weights of linear layers
reducing memory overhead and computational costs during finetuning.

Pre-trained Pre-trained 𝐁=0

trainable Weights Weights 𝑟
𝐖& ∈ ℝ'×) 𝐖& ∈ ℝ'×)
𝐀 = 𝒩 0, 𝜎 !
frozen

Full Fine-Tuning (FFT) LoRA

https://arxiv.org/abs/2106.09685
Motivation of LoRA
• Core Finding: Low Intrinsic Dimensionality in Language Models
• Significance of Intrinsic Dimensionality:
§ Intrinsic dimensionality is a crucial metric that explains why large
language models are efficiently fine-tunable with limited data.
• Brader Impact:
§ Understanding intrinsic dimensionality could lead to more resource-efficient and
effective ways to train and deploy language Models
What is LoRA? Rank-1 Matrix
2 10 1
• LoRA – Low-Rank Adaptation 𝑟𝑎𝑛𝑘 4 20 2 = 1
6 30 3
§ Low-rank: Rank 𝑟 of the matrix is smaller
Rank-2 Matrix
than matrix’s dimension 𝑑
2 10 1
§ Rank: Minimum number of independent 𝑟𝑎𝑛𝑘 7 20 2 = 2
6 30 3
rows/columns
Rank-3 Matrix
§ Adaptation: Fine-tuning of models
2 3 1
𝑟𝑎𝑛𝑘 7 5 2 = 3
6 1 3
𝑑=3
23
Recap: Matrix Rank
• The rank of a matrix 𝐀 ∈ ℝ'×) is equal to the minimum number of linearly independent
columns or rows, and always satisfies:
rank 𝐀 ≤ min 𝑑, 𝑘
• A matrix with rank 𝐀 = min 𝑑, 𝑘 is called a Full-Rank Matrix. For example, the follow
3x3 matrix has a rank of 3, making it a full-rank matrix:
2 3 1
rank 7 5 2 = 3
6 1 3
• A matrix with rank 𝐀 < min 𝑑, 𝑘 is called a Low-Rank Matrix. Here are two examples of
low-rank matrices:
2 10 1 2 10 1
rank 4 20 2 =1 rank 7 20 2 =2
6 30 3 6 30 3
24
Rep: Rank Matrix Decomposition
• Low-Rank Matrices can be Decomposed into Low-Dimensional Matrices
• A low-rank matrix can be decomposed into the product of two low-dimensional matrices.
For instance, a rank-1 3x3 matrix can be decomposed as follows:
2 10 1 1
4 20 2 = 2 × 2 20 30
6 30 3 3 1×3
3×3 3×1
§ This decomposition reduces the number of coefficients needed to represent the matrix from 9 to
6 = (3×1 + 1×3).
• In general, a rank-r 𝑛×𝑛 matrix can be decomposed into an 𝑛×𝑟 matrix and an 𝑟×𝑛 matrix.
• For r=1, the reduction in coefficients is: 𝑛* ⇒ 2𝑛
• The general reduction in coefficients for a rank-r matrix is: 𝑛* ⇒ 2 < 𝑟 < 𝑛

25
Recap: Matrix Representation of FFN
• Formulation of Hidden Layer 1: 𝐚(() = 𝑔 𝒛(() = 𝑔 𝐖 (() 𝐱 + 𝐛 (()
Input (&) (&) (&) (&) (&)
𝑤&,& 𝑤&,) 𝑤&,* 𝑤&,+ 𝑏&
𝑥)
(&) (&) (&) (&) (&)
1 Layer 1 𝑥 𝑤),& 𝑤),) 𝑤),* 𝑤),+ 𝑏)
𝐱 = 𝑥* 𝐖 (&) = (&) (&) (&) (&)
𝐛 (&) = (&)
+ 𝑤*,& 𝑤*,) 𝑤*,* 𝑤*,+ 𝑏*
(&) 𝑥,
𝑥, 𝑎** 𝑎& (&)
𝑤+,&
(&)
𝑤+,)
(&)
𝑤+,*
(&)
𝑤+,+
(&)
𝑏+
(&)
𝑥* 𝑎+* 𝑎)
• Common pre-trained models have been
𝑥- 𝑎,*
(&)
𝑎* empirically shown that having a very low intrinsic
(&)
dimension (low-rank)
𝑥. 𝑎-* 𝑎+ • In other words, there exists a low dimension
𝐖(*) , 𝐛(*)
𝐚(&)
reparameterization that is as effective for
finetuning as the full-parameter space.
https://arxiv.org/pdf/2012.13255.pdf
26
LoRA (Low-Rank Adaptation)
𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱
• Use Low-rank submodules to modify hidden
representations 𝑑 𝑑
§ Pretrained Weights: 𝐖0 ∈ ℝ1×2
§ Introduce two new smaller matrices 𝐀 ∈ ℝ3×2
and 𝐁 ∈ ℝ1×3 Pretrained 𝐁=0

§ 𝑟 ≪ min 𝑑, 𝑘 and 𝑟 is usually between 1 to 32 Weights 𝑟

𝑑 (Frozen)
§ The input 𝐱 ∈ ℝ2×( and output 𝐡 ∈ ℝ1×( are 𝐖- ∈ ℝ.×0
column vectors 𝐀 = 𝒩 0, 𝜎 )

𝐡 = 𝐖0 + ∆𝐖 𝐱 = 𝐖0 + 𝐁𝐀 𝐱 𝑘 𝑘
𝑘
where ∆𝐖 = 𝐁𝐀 is a low-rank matrix

https://arxiv.org/abs/2106.09685 𝐱
27
LoRA (Low-Rank Adaptation)

𝐡 = 𝐖! 𝐱 + 𝐁𝐀𝐱
𝑘
𝐀 ∈ ℝ,×$

𝐀 = 𝒩 0, 𝜎 )
Pretrained

= Weights
+

𝐁=0
𝑑 (Frozen) 𝑘
𝐖- ∈ ℝ.×0

𝐡 ∈ ℝ"×& 𝐱 ∈ ℝ$×& 𝐁 ∈ ℝ"×, 𝐱 ∈ ℝ$×&

28
LoRA (Low-Rank Adaptation)
𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱 𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱

𝑑 𝑑 𝑑 𝑑

𝐁=0 𝐀 = 𝒩 0, 𝜎 ) 𝑟
Pretrained Pretrained
Weights Weights

𝐁=0
𝑑 𝑟 𝑑 𝑑
(Frozen) (Frozen) 𝑘
𝐖- ∈ ℝ.×0 𝐖- ∈ ℝ.×0
𝐀 = 𝒩 0, 𝜎 )

𝑘 𝑘

𝐱 ∈ ℝ2×( 𝐱 ∈ ℝ2×(
29
LoRA: ∆𝐖 = 𝐁𝐀

0.3 −0.14 0.15 −0.14 −0.21 0.612

0.1 −0.44 0.04 1.42
∆𝐖 = 𝐁𝐀 = −0.42 0.201 = −0.22 0.204 0.308 −0.86
0.46 0.38 −0.92 0.1 1.62 −1.33 −0.30 −0.16 0.634 0.147
0.5 0.14 −0.07 −0.2 0.246 0.523

𝐁 𝐀 ∆𝐖

ℎ& 𝑤)) 𝑤)* 𝑤)+ 𝑤), 𝑥& 0.15 −0.14 −0.21 0.612 𝑥&
ℎ 𝑤 𝑤** 𝑤*+ 𝑤*, 𝑥) −0.22 0.204 0.308 −0.86 𝑥)
𝐡 = ) = 𝐖! 𝐱 + ∆𝐖𝐱 = 𝑤*) 𝑤+* 𝑤++ 𝑤+, 𝑥* + −0.30 𝑥*
ℎ* +) −0.16 0.634 0.147
ℎ+ 𝑤,) 𝑤,* 𝑤,+ 𝑤,, 𝑥+ −0.07 −0.2 0.246 0.523 𝑥+

𝐖0 ∆𝐖

30
LoRA High Parameter Efficiency Example
• X% of weights trained => (a) lower memory footprint and (b) faster finetuning jobs

𝐡 = 𝐖0 𝐱 + 𝐁𝐀𝐱

B.shape == (1024,8)
B parameter count
= 1024 x 8 = 8,192
W.shape == (1024,1024) Pretrained 𝐁 A.shape == (8, 1024)
W parameter count
Weights
A parameter count
= 10242 = 1,048,576 𝐖- ∈ ℝ.×0 𝐀
= 8 x 1024 = 8,192

Total parameter count

= 8192 x 2 = 16,384
𝐱

31
Applying LoRA to Feed-Forward Networks
• LoRA can be applied to each fully-connected layer and the classifier heads
are the task-specific modules
+
Classifier
B
Head Feed-forward
down-project A
Transformer Layer 12 w/ LoRA
...

Nonlinearity
Transformer Layer 2 w/ LoRA
+
Transformer Layer 1 w/ LoRA B
Feed-forward
up-project
Embedding Layer A

32
Applying LoRA To Transformers

LoRA 𝐐
𝐊
𝐕

33
Which Weight Matrices?

https://arxiv.org/pdf/2106.09685.pdf

34
Scaling Factor 𝛼
• The scaling factor 𝛼 is used to adjust the output of matrices 𝐁 and 𝐀.
Scaling factor
𝛼
𝐖! + ∆𝐖 = 𝐖! + 𝐁𝐀
𝑟
Rank
• It is divided by the rank 𝑟, which represents the intrinsic dimension and determines the
level of decomposition or compression applied to the weights.
• Typically, the rank ranges from 1 to 64, while the scaling factor 𝛼 controls the amount of
change applied to the original model weights, striking a balance between the knowledge of
the pre-trained model and its adaptation to a new task.
• Both the 𝛼 and 𝑟 are hyperparameters, which need to be tuned.
§ Basically, the scaling factor 𝛼 helps in stabilizing other hyperparameters, such as learning rates, when
the rank is varied. By adjusting the rank and incorporating the scaling factor, one can explore different
levels of decomposition without needing to extensively tweak other parameters. This approach
simplifies the process of finding the optimal level of decomposition for a given task.
35
How Low-Rank can LoRA go?
• LoRA works even with extremely small values of r such as 4, 2, or even 1.
• On the WikiSQL and MultNLI problem datasets, the authors found no
statistically significant difference in performance when reducing the rank r=64
to r=1.

https://arxiv.org/pdf/2106.09685.pdf
LoRA Performance Parity with Fully Finetuned LLMs

LoRA is extremely competitive with full finetuning

37
Extremely Parameter Efficient Finetuning

And the number of traninable parameter is less than 1% of the total model size

38
LoRA Comparison on GPT-3
LoRA reduces the number of trainable parameters in GPT-3 by 5 orders of magnitude!

https://arxiv.org/pdf/2106.09685.pdf

39
Hugging Face PEFT
https://huggingface.co/docs/peft/en/index

We don't have to manually apply a low-rank decomposition to each layer individually. Instead, we can
use the "get_path_model" function, which takes care of this process for us.
40
Benefits of LoRA
• Less Parameters: LoRA reduces computational requirements during training,
leading to faster training and lower memory usage.
• Flexibility: Switch between different LoRA weights
• Seamless Integration: The ∆𝐖 (𝐁𝐀) weights from the rank decomposition can
be merged with the original model weights by simply adding them together,
without introducing any overhead during inference.
𝐡 = 𝐖𝐱 + 𝐁𝐀𝐱
𝐡 = 𝐖′𝐱

Merged
During Pretrained 𝐁 𝐡 = 𝐖! 𝐱 + 𝐁𝐀𝐱 Weights After
Weights
Training 𝑟 Training
𝐡 = 𝐖! + 𝐁𝐀 𝐱 𝐖′ ∈ ℝ'×)
'×)
𝐖& ∈ ℝ
𝐀
𝐖′
𝐱
𝐱 41
LoRA: A New Paradigm Shift in NLP
• LoRA enables us to adapt pretrained LLMs to specific downstream
tasks faster, more robustly, and with orders of magnitudes fewer
learnable parameters compared to standard fine-tuning.
• LoRA's success suggests low-rank, coarse-grained weight updates
during fine-tuning, akin to "remembering" over "learning".
• Lowest possible rank depends on downstream task difficulty relative
to pre-training.
• Lower ranks expected in earlier Transformer layers, higher ranks in
later layers.
QLoRA

https://arxiv.org/pdf/2305.14314.pdf
QLoRA: LoRA with 4-bit Quantization (2023-05)
• In LoRA, the pretrained weights 𝐖0 still account for large memory
§ LLaMA-3-70B model with 32-bit precision requires 820GB of GPU memory.

• QLoRA (Quantized LoRA) integrates two concepts:

𝐡 = 𝐖𝐱 + 𝐁𝐀𝐱
1. LoRA (a technique for efficient language model tuning)
2. Model Quantization (a method for reducing model size and
memory usage)
4-bit Quantized 𝐁=0
§ This involves converting the model's parameters from full-precision
Pretrained
32-bit floating-point numbers (FP32) to lower-precision 4-bit Weights 𝑟
floating-point numbers (NF4) with double quantization. 𝐖& ∈ ℝ'×) 𝐀 = 𝒩 0, 𝜎 )
§ This specialized format significantly reduces the memory footprint,
enabling fine-tuning of language models on resource-constrained
devices, such as a single GPU machine. 𝐱
https://arxiv.org/pdf/2305.14314.pdf 44
Full Fine-Tuning vs LoRA vs QLoRA
QLoRA improves
over LoRA by
quantizing the
transformer
model to 4-bit
precision and
using paged
optimizers like
AdamW to
handle memory
spikes
4-bit Transformer

https://arxiv.org/pdf/2305.14314.pdf
45
Innovations of QLoRA
1. NF4 (4-bit NormalFloat):
• A specialized 4-bit floating-point format that normalizes weight values to the
range [-1, 1] before quantization, allowing for a more accurate representation of
the weight distribution and outperforming other 4-bit quantization techniques.
2. Double Quantization (DQ):
• A nested quantization technique that combines NF4 with further compression of
quantization constants to an 8-bit format, resulting in significant memory savings
(around 3GB for massive models like LLaMA-65B).
3. Page Optimizers:
• A technique that optimizes memory access patterns, reducing memory usage
and improving model performance (not described in detail in the provided text,
but mentioned as one of the innovations of QLoRA).

46
Block-wise k-bit Quantization
• Quantization is the process of discretizing an input from a representation that holds more
information to a representation with less information.
• It often means taking a data type with more bits and converting it to fewer bits, for example from 32-
bit floats to 8-bit Integers.
• To ensure that the entire range of the low-bit data type is used, the input data type is commonly
rescaled into the target data type range through normalization by the absolute maximum of the
input elements, which are usually structured as a tensor.
• For example, quantizing a 32-bit Floating Point (FP32) tensor into a Int8 tensor with range [-127,
127]:
127
𝐗 6789 = round :;+*
, 𝐗 6789 = round 𝑐 :;+* , 𝐗 678+*
absmax 𝐗
where c is the quantization constant or quantization scale.
• Dequantization is the inverse:
𝐗 6789
dequant 𝑐 :;+* , 𝐗 678+* = :;+* = 𝐗 678+*
𝑐
47
12*) -./+
𝐗 -./+
dequant 𝑐 ,𝐗 = 12*) = 𝐗 -./*)
𝑐

127
𝐗 -./0 = round 12*)
X 𝐗 -./0 = round 𝑐 12*) X 𝐗 -./*)
absmax 𝐗

12*) -./*)
𝐗 -./0
dequant 𝑐 X𝐗 = 12*) = 𝐗 -./*)
𝑐

16
𝐗 -./+ = round 12*)
X 𝐗 -./+ = round 𝑐 12*) X 𝐗 -./+
absmax 𝐗

&3
𝑐 12*) = is the quantization constant (scale factor)
456748 𝐗 #$%&
QLoRA
• Using the components described above, we define QLoRA for a single linear layer in the
quantized base model with a single LoRA adapter as follows:

𝐘 <:)= = 𝐗 <:)= doubleDequant 𝑐):;+* , 𝑐*>?@A8 , 𝐖 B:, + 𝐗 <:)= 𝐋<:)=

) 𝐋<:)=
*

where dequant G is defined as:

𝐗 -./0
doubleDequant 𝑐&12*) , 𝑐):;5</ , 𝐖 =1+ = dequant dequant 𝑐 12*)
,𝐗 -./*)
,𝐖 +5</
= 12*) = 𝐖 >1&3
𝑐
• We use NF4 for W and FP8 for 𝑐( . 𝐡 = 𝐖𝐱 + 𝐁𝐀𝐱

• We use a block-size of 64 for W for higher quantization precision

and a block-size of 256 for 𝑐* to conserve memory. Pretrained 𝐁
Weights
𝑟

𝐖& ∈ ℝ'×) 𝐀

𝐱 49
4-bit NormalFloat (NF4)
• According to the QLoRA paper, pre-trained parameters are generally in
accordance with a zero-centered normal distribution with a standard
deviation of σ. We can scale σ to transform all weights into a single fixed
distribution that fully adapts to the data range specified by QLoRA.
• Motivated by this, QLoRA calculates the values of qj based on the
quantiles of the normal distribution.
• The current problem is how to calculate 16 quantiles:
𝑞", … , 𝑞"# ∈ −1,1
4-bit NormalFloat (NF4)
• NF4 is an
information-
theoretically optimal
data type for normal
distributions

51
https://ai.plainenglish.io/qlora-key-quantization-and-fine-tuning-techniques-in-the-era-of-large-language-models-0fa05a961d27
Double Quantization
• Block-wise Quantization
§ We know that the essence of quantization is to map values from a larger range to
a smaller range. We can use a constant c to proportionally reduce the values. In
this way, we can easily use the same constant c to dequantize the quantized
values back to their original (approximate) form.
§ However, if our data contains outliers, this will affect the selection of c and cause
other values to collapse within a small range. Block-wise provides a solution to
this by quantizing one block at a time, with each block using its own independent
quantization constant c.
• Since quantization constants are typically stored as FP32, the memory usage
can become significant when there are a large number of blocks.
The Approach of QLoRA
• QLoRA divides the parameters into blocks of size 64.
§ Each block calculates a quantization constant, denoted as c.
• QLoRA further quantizes the quantization constants into FP8 using Double
Quant, with a block size of 256.
• This further reduces the memory consumption.
• Before Double Quant:
• Quantizing each parameter requires an additional 32/64 = 0.5 bits of memory.
• After Double Quant:
• Quantizing each parameter only requires an additional 8/64 + 32 / (64*256) = 0.127 bits of
memory.
Double Quantization: Reduce absmax constant size

0.127 bits of memory

55
Paged Optimizer: Prevent Memory Spikes
• Page-by-page transfers of memory from CPU <=> GPU as needed
§ Lazy and does not need to be managed (no offloading, everything is automatic).
• The Page Optimizer mechanism allows for transferring the optimizer to memory
when GPU memory is limited.
§ It can be loaded back when the optimizer state needs to be updated.
§ It is said to effectively reduce the peak occupancy of GPU memory.
• The paper of QLoRA states that this mechanism is necessary to train a model
with 33 billion parameters on a 24GB GPU.
• This mechanism can be easily configured by setting the parameters of Training
Arguments:
§ optim = ‘paged_adamw_32bit’

56
Paged Optimizer: Prevent Memory Spikes
• Paged Optimizer works like this:
1. A large mini-batch (long sequence length) uses more GPU memory than
available
2. Paging engine evicted optimizer state to CPU
3. During optimizer step all optimizer states are prefetched to the GPU
4. Do an optimizer step
5. Continue to process everything on the GPU as long as the mini-batch does not
cause an eviction

57
How does QLoRA reduce memory to 14GB?
• Below is the calculation to determine the memory requirements for fine-tuning LLaMA3–8B with
QLoRA.
§ Memory requirement for loading the 4-bit quantized model:
• The LLaMA3-8B base model has about 8 billion parameters, and each parameter is quantized to 4 bits
(0.5 bytes). Hence, loading the model would take about 4GB ( 8 billion parameters × 0.5 bytes).
§ Memory requirement per trainable parameter consists of:
• Weight: 0.5 bytes
• LoRA parameters: 2 bytes
• AdamW optimizer states: 2 bytes
• Gradients (always in fp32): 4 bytes
• Therefore, the memory per trainable parameter is 8.5 bytes ( ≈ 0.5 + 2 + 2 + 4)
§ Total memory requirement for trainable parameters:
• LoRA results in an average of 0.4-0.7% trainable parameters, assuming that there are 0.6% of trainable parameters
• The total trainable parameters memory :
Memory per parameter * parameters = 8.5 bytes * 48 million (0.6% of 8B parameters) ≈ 0.408 GB
How does QLoRA reduce memory to 14GB?
• Total memory requirement for LLaMA3-8B QLoRA Training: The total memory requirement for
QLoRA training is around 4.1GB, which includes the memory for the base model and the memory for
trainable parameters ≈ 0.408 GB, resulting in a total training memory requirement of about ≈ 4–5
GB (depending on the number of trainable parameters).
• Memory required for Inference: If we load the base model in 16-bit precision and merge the LoRA
weights of the fine-tuned model, we would at-most use 14 GB of GPU memory for a sequence
length of 2048. This memory cost is derived from loading the model in float16 precision and includes
activations, temporary variables and hidden states, which are always in full-precision (float32) format
and depend on many factors including sequence length, hidden size and batch size.
• Total memory requirements: So, the total memory requirement for QLoRA training with a 4-bit base
model and mixed-precision mode, including loading the 32-bit model for inference, would be
almost ≈ 14 GB depending on the sequence length.
• Thus, we can see that using quantization techniques like QLoRA along with PEFT can significantly
reduce memory requirements by up to 90%, thereby making fine tuning more accessible and
affordable!
How does QLoRA reduce memory to 14GB?
• Below is the calculation to determine the memory requirements for fine-tuning LLaMA3–8B with
QLoRA.
§ Memory requirement for loading the 4-bit quantized model:
• The LLaMA3-8B base model has about 8 billion parameters, and each parameter is quantized to 4 bits
(0.5 bytes). Hence, loading the model would take about 4GB ( 8 billion parameters × 0.5 bytes).
§ Memory requirement per trainable parameter consists of:
• Weight: 0.5 bytes
• LoRA parameters: 2 bytes
• AdamW optimizer states: 2 bytes
• Gradients (always in fp32): 4 bytes
• Therefore, the memory per trainable parameter is 8.5 bytes ( ≈ 0.5 + 2 + 2 + 4)
§ Total memory requirement for trainable parameters:
• LoRA results in an average of 0.4-0.7% trainable parameters, assuming that there are 0.6% of trainable parameters
• The total trainable parameters memory :
Memory per parameter * parameters = 8.5 bytes * 48 million (0.6% of 8B parameters) ≈ 0.408 GB
QLoRA

61
62
Large Models are Not easily accessible

https://www.youtube.com/watch?v=fQirE9N5q_Y

63
QLoRA
• QLoRA Hyperarameter settings:
§ Alpha determines the multiplier applied to the weight changes when added to the original
weights
• Scale multiplier = Alpha / Rank
• Microsoft LoRA repository sets to 2 x Rank
• QLorA wen with ¼ of Rank (alpha = 16, r = 64)
§ Droput is a percentage that randomly eaves out some weight changes each time to deter
overfitting
• QLoRA paper went with 0.1 for 7B-13B, 0.05 for 33B=65B models
• QLoRA paper has two interesting findings:
§ Training all layers of the network is necessary to match performance of full-parameter fine-tuning
§ Rank may not matter from 8 to 256

https://www.youtube.com/watch?v=t1caDsMzWBk
QLoRA Summary
• QLoRA uses NF4, double quantization, and paged optimizers combined
with LoRA to replicate 16-bit full finetuning performance at a 17x
smaller memory footprint.
• While evaluation is noisy, Guanaco models outperform existing open-
source models on the Vicuna benchmark.

65
Optional Content
§ DoRA (Optional)
§ MoRA (Optional)
§ SBoRA (Optional)
§ Multi-SBoRA (Optional)

https://www.youtube.com/watch?v=WLDehSkSIhY
66
DoRA (Wang et al., 2024-02)
Weight-Decomposed Low-Rank Adaptation
• In DoRA, the original model weights are Pretrained
Weight
Merged
Weight
decomposed into a Magnitude vector 𝐖' ∈ ℝ)×+ 𝐖 ∈ ℝ)×+
𝐕
𝐦 and a Direction matrix 𝐕 prior to Decompose
? Merge
(Initialize)
applying LoRA Magnitude Magnitude
∈ ℝ,×+ 𝐦 ∈ ℝ,×+
• The direction matrix is created by 𝐦 = 𝐖' (

scaling each column of the matrix to Direction Direction

𝐖'
be unit length 1/ ( 1/ 𝐕' + ∆𝐕 (

• The magnitude vector stores the scaling Pretrained

Adapt
∆𝐕 ∈ ℝ)×+
factors needed for each column Weight Pretrained 𝐁
"×$ Weight 𝑟
𝐕! = 𝐖! ∈ ℝ )×+
𝐕' ∈ ℝ
𝐀
https://arxiv.org/html/2402.09353v3
DoRA: Weight Decomposition Analysis
• In DoRA, the weight matric 𝐖 ∈ ℝ1×2 is decomposed into
the magnitude and the direction: Merged
Weight
𝐖 𝐕 𝐖 ∈ ℝ)×+
𝐖= 𝐖 " 𝐖 =𝐦
! 𝐕 !
Merge
§ 𝐕 H is vector-wise norm of the matrix across each column such Magnitude

that 𝐕/ 𝐕 H is unit vector 𝐦 ∈ ℝ,×+

§ 𝐕 ∈ ℝ.×0 is the direction matrix with initial value of 𝐕- = 𝐖- Direction

𝐕' + ∆𝐕
§ 𝐦 ∈ ℝ)×0 is the magnitude vector (trainable) with initial value of 1/ (

𝐖- H
∆𝐕 ∈ ℝ)×+
• The magnitude vectors 𝐦 are small, and are trained Pretrained 𝐁
normally Weight 𝑟
𝐕' ∈ ℝ)×+ 𝐀
• The direction matrices 𝐕 are big, LoRA is used to fine-tune
it with frozen 𝐕0
DoRA Methodology
• DoRA optimize both the magnitudes and directions of the pre-trained weights. Since the directional
component is large in terms of the number of parameters, we further decompose it with LoRA.
𝐕 𝐕- + ∆𝐕 𝐖- + 𝐁𝐀
𝐖=𝐦 =𝐦 =𝐦
𝐕 H 𝐕- + ∆𝐕 H 𝐖- + 𝐁𝐀 H

§ ∆𝐕 is the incremental direction update.

§ 𝐕& + ∆𝐕 is just a normalization term and DoRA takes as a constant. It won’t receive gradients
2

during backpropagation.
• LoRA is applied to the transformer's query and value matrices, and the magnitude and directional
differences between the original and finetuned weight matrices are calculated.
• For inference, the magnitude vectors and updated direction matrices can be combined back into
updated weights for the original model
Analysis of Parameter Update Correlations
• For evaluating the parameter update correlations between full finetuning (FT), LoRA and
DoRA, the authors calculated:
§ Directional Change using cosine similarity ∆𝐃978 :

∑0L?) 1 − 𝐜𝐨𝐬 𝐕IJ

L
, 𝐖-L
∆𝐃KIJ =
𝑘
9
§ Magnitude differences between original and LoRA-updated weight matrices ∆𝐌78 :

K
∑0L?) 𝐦L,0 L
IJ − 𝐦-
∆𝐌IJ =
𝑘
at four training step checkpoints.
where t is the number of training steps.
Parameter Update Correlations: FFT vs LoRA
• DoRA’s authors found that changes to magnitude ∆𝐌 and direction ∆𝐃 for full
finetuning (FFT) are largely independent (small negative correlation). Moreover, they
spread with a high variance
• However, LoRA creates highly positive correlated changes (positive slope) with a
significantly low variance, which hurting performance
§ LoRA cannot make slight directional changes alongside significant magnitude alterations.
FFT

Negative Correlation Positive Correlation

Parameter Update Correlations: FFT vs DoRA
• DoRA fixes the problem, resulting in pattern more similar to full finetuning,
improving performance compared to LoRA
§ DoRA produces a pattern that closely resembles FFT.

FFT

The distribution for DoRA closely resembles FFT.

DoRA Commonsense Reasoning Results
• First test was commonsense reasoning with LLaMA, LLaMA2 and LLaMA3 models
DoRA Results on Multimodal Tasks
• DoRA also beat LoRA on multimodal tasks with image & video on VL-
BART, and with visual instruction tuning on LLaVA
DoRA Ablations
• DoRA performs significantly better than LoRA when the rank is small. Due to the
magnitude vectors, the number of parameters increases marginally.
• Also ablated using just the magnitude adapter in some components
DoRA Summary
• Key Findings on LoRA Training
§ LoRA training artificially forces large changes in direction to have large changes in
magnitude, and vice versa.
• Decoupling Magnitude and Direction Changes
§ Authors decomposed each weight matrix into a magnitude vector and a direction
matrix to decouple magnitude and direction changes.
• Improved Performance
§ Consistently beat LoRA performance with many different models, tasks, and rank
settings.
§ Improved VeRA performance, suggesting potential generalizability to other variants.
MoRA (T. Jiang et al. 2024-05)
• MoRA is a new PEFT method for LLMs, which
uses square matrices to enable high-rank
MoRA
updating.
• It outperforms Low-Rank Adaptation (LoRA) in decompress
Pretrained
memory-intensive tasks and continual pre- Weights
𝐌 𝑟×𝑟
training while maintaining comparable 𝐖! ∈ ℝ "×$

compress
performance in instruction tuning.
• MoRA addresses LoRA’s limitations in
knowledge enhancement, demonstrating
superior ability to memorize new information.
MoRA: High-Rank Updating for PEFT
• MoRA's innovation is its use of a compress function to project input vectors into a
lower-dimensional space, perform high-rank transformations using a smaller matrix,
and then apply a decompress function to project the result back into the original
higher-dimensional space.
𝐡 ∈ ℝ"×& 𝐡 ∈ ℝ"×&

decompress
Pre-trained 𝐁 Pretrained
trainable Weights Weights
𝑟 𝐌 𝑟×𝑟
'×) '×)
𝐖& ∈ ℝ 𝐖& ∈ ℝ
𝐀 compress

frozen

𝐱 ∈ ℝ$×& 𝐱 ∈ ℝ$×&
LoRA (r = 8) MoRA (r = 256)
MoRA: High-Rank Updating for PEFT
• MoRA's innovation is its use of compress and
decompress functions to project input vectors into a 𝐡 ∈ ℝ"×&
lower-dimensional space, perform high-rank
transformations with a smaller matrix, and then
decompress
project back to the original space. Pretrained
Weights
𝐌 𝑟×𝑟
∆𝐖𝐱 = 𝑓:;<=>?(𝐌 𝑓<=>? 𝐱 ) 𝐖& ∈ ℝ'×)
compress
𝐡 = 𝐖𝟎 𝐱 + 𝑓abcdef (𝐌 𝑓cdef 𝐱 )
• This approach, using non-parameterized operators,
𝐱 ∈ ℝ$×&
enables high-rank updates without increasing
MoRA (r = 256)
trainable parameters, differing from LoRA.
Parameter Efficiency of MoRA
• For example, it is given that the weight layer is 4096 x 4096, which means 16,777,216
parameters would need to be updated with FFT.
• If r = 8 is chosen with LoRA, it would result in 2 x (4096 x 8) = 65,536 parameters being updated.
• With MoRA, if a 256 x 256 matrix is chosen for M, it would mean 256 x 256 = 65,536 parameters
would need to be tuned, the same number as LoRA.
𝐡 ∈ ℝ"×& 𝐡 ∈ ℝ"×&

decompress
Pre-trained 𝐁 Pretrained
trainable Weights Weights
𝑟 𝐌 𝑟×𝑟
'×) '×)
𝐖& ∈ ℝ 𝐖& ∈ ℝ
𝐀 compress
frozen

𝐱 ∈ ℝ$×& 𝐱 ∈ ℝ$×&
LoRA (r = 8) MoRA (r = 256)
Non-Parameterized Compress and Decompress
• MoRA explored four non-parameterized methods for designing
compress and decompress operators:
1. Truncation: Simple, but can result in significant information loss.
2. Row and Column Sharing: Effective for larger ranks (r=128, 256), preserving
more input information.
3. Decomposition (for smaller ranks, r=8): Breaks input vectors into subvectors to
mitigate information loss.
4. Rotation (inspired by RoPE): Integrates rotation operators to boost the
expressive power of the matrix M, capturing nuanced differences between input
segments.
Performance of MoRA
• In DoRA evaluation experiments, the authors first focused on memorizing UUID pairs,
comparing MoRA with LoRA and FFT. Using ranks 8 and 256, MoRA demonstrated significant
improvements over LoRA while using the same number of trainable parameters.
• MoRA required fewer training steps to memorize UUID pairs compared to LoRA. At rank
256, MoRA achieved performance similar to FFT, with both methods able to memorize all
pairs within 500 steps.

Character-level accuracy of
memorizing UUID pairs by
generating the value of
corresponding key in 300,
500, 700 and 900 training
steps.
Performance of MoRA
• MoRA was evaluated across three fine-tuning tasks:
1. Instruction tuning on Tülu v2 dataset, with zero-shot and five-shot MMLU evaluations.
2. Mathematical reasoning on MetaMath dataset, with GSM8K and MATH evaluations.
3. Continual pretraining on biomedical and financial domains using PubMed abstracts and financial news data.
Performance of MoRA
• In these fine-tuning tasks, MoRA was compared against several methods including FFT, LoRA, LoRA+,
AsyLoRA, ReLoRA, and DoRA. Results showed that MoRA performed comparably to LoRA on
instruction tuning and mathematical reasoning tasks. However, MoRA outperformed LoRA in
continual pretraining for both biomedical and financial domains. Generally, higher ranks (256 vs 8)
improved performance, especially in mathematical reasoning tasks.
Summary of MoRA
• MoRA introduces non-parameter operators to reduce input dimensions and
increase output dimensions for the square matrix, allowing it to be merged
back into the LLM like LoRA. The method is evaluated across five tasks:
instruction tuning, mathematical reasoning, continual pretraining, memory,
and pretraining.
• Results show that MoRA outperforms LoRA on memory-intensive tasks and
achieves comparable performance on other tasks, demonstrating the
effectiveness of high-rank updating. The authors provide a detailed analysis of
their method, including various implementations of the compression and
decompression functions used in MoRA.
SBoRA (LM Po et al., 2024-07)
• SBoRA: Low-Rank Adaptation with Regional Weight Updates
• SBoRA enables regional weight updates and memory-efficient finetuning. The majority of
the finetuned model’s weights remain unchanged from the pre-trained weights.
• This characteristic of SBoRA is reminiscent of the modular organization of the human brain,
which efficiently adapts to new tasks.
SBoRA-FA and SBoRA-FB
• SBoRA (Standard Basis LoRA) adopts a unique approach, utilizing orthogonal
standard basis vectors to construct its projection matrices. These matrices,
designated as Asb for SBoRA-FA (Fixed A) and Bsb for SBoRA-FB (Fixed B)
Standard Orthogonal Basis
• SBoRA leverages standard basis vectors to construct the fixed 𝐀 or 𝐁 matrices of the LoRA
decomposition.
• Specifically, the shared orthogonal basis is identity matrix, comprising standard basis vectors (one-
hot vectors) as their rows or columns.
1 0 ⋯ 0
𝐈= 0 1 ⋯ 0
⋮ ⋮ ⋱ 0
0 0 0 1
• These standard basis vectors, denoted as 𝐞M , possess a single non-zero entry of 1 at index i:
Row standard basis vector: 𝐞M = 0 … 0 1 0 … 0
J
Column standard basis vector: 𝐞M = 0 … 0 1 0 … 0

• SBoRA initializes one fixed matrix (either 𝐀 or 𝐁) using these standard basis vectors 𝑒M , while the
other matrix is initialized with zeros, resulting in two variants: SBoRA-FA (Fixed Matrix A) and SBoRA-
FB (Fixed Matrix B).
SBoRA-FA: Regional Weight Update Example
• The finetuned weight 𝐖′ of SBoRA-FA can be represented as:
𝐖′ = 𝐖! + ∆𝐖 = 𝐖! + 𝐁𝐀 @A
• The update matrix ∆𝐖 = 𝐁𝐀 @A is very sparse, with most of the columns having zero weights due to the one-hot
nature of the standard basis subspace matrix 𝐀 @A .
• For example, when 𝑟 = 2 and 𝑘 = 4 with 𝐀 @A = 𝐞& 𝐞+ B , the ∆𝐖 will only have two non-zero columns as

𝑏&& 𝑏&) 𝑏&& 0 0 𝑏&)

𝑏 𝑏)) 1 0 0 0 𝑏 0 0 𝑏))
∆𝐖 = 𝐁𝐀 @A = )& = )&
𝑏*& 𝑏*) 0 0 0 1 𝑏*& 0 0 𝑏*)
𝑏+& 𝑏+) 𝑏+& 0 0 𝑏+)
• The fine-tuned weight 𝐖 C is given by

𝑤&& + 𝑏&& 𝑤&) 𝑤&* 𝑤&+ + 𝑏&)

𝑤 + 𝑏)& 𝑤)) 𝑤)* 𝑤)+ + 𝑏))
𝐖 C = 𝐖! + ∆𝐖 = )& 𝑤*) 𝑤**
𝑤*& + 𝑏*& 𝑤*+ + 𝑏*)
𝑤+& + 𝑏+& 𝑤+) 𝑤+* 𝑤++ + 𝑏+)
• Regional weight update process of SBoRA, showcasing distinct 𝐖- + ∆𝐖 computing
procedures of SBoRA-FA(upper) and SBoRA-FB(lower). The diagram employs
different colors to represent frozen, trainable, and zero parameters.
SBoRA-FB: Regional Weight Update Example
• The finetuned weight 𝐖′ of SBoRA-FB can be represented as:
𝐖′ = 𝐖! + ∆𝐖 = 𝐖! + 𝐁@A 𝐀
• The update matrix ∆𝐖 = 𝐁@A 𝐀 is very sparse, with most of the row having zero weights due to the one-hot
nature of the standard basis subspace matrix 𝐁@A .
• For example, when 𝑟 = 2 and 𝑘 = 4 with 𝐀 @A = 𝐞& 𝐞+ B , the ∆𝐖 will only have two non-zero row as

1 0 𝑏&& 𝑏&) 𝑏&* 𝑏&+

∆𝐖 = 𝐁@A 𝐀 = 0 0 𝑏&& 𝑏&) 𝑏&* 𝑏&+
= 0
0 0 0 0
0 0 𝑏)& 𝑏)) 𝑏)* 𝑏)+ 0 0 0
0 1 𝑏)& 𝑏)) 𝑏)* 𝑏)+
• The fine-tuned weight 𝐖 C is given by

𝑤&& + 𝑏&& 𝑤&) + 𝑏&) 𝑤&* + 𝑏&* 𝑤&+ + 𝑏&+

𝑤)& 𝑤)) 𝑤)* 𝑤)+
𝐖 C = 𝐖! + ∆𝐖 = 𝑤*& 𝑤*) 𝑤** 𝑤*+
𝑤+& + 𝑏)& 𝑤+) + 𝑏)) 𝑤+* + 𝑏)* 𝑤++ + 𝑏)+
SBoRA: Commonsense Reasoning Performance

Comparison of LLaMA-7B and LLaMA3-8B with different PEFT methods, and evaluating on the commonsense reasoning task. The
first row list the results of GPT-3.5 for reference. We report the accuracy (%) results for each of the eight sub- tasks as well as the
average accuracy (%), higher is better for all metrics. The column headers indicates TP for the number of trainable parameters and
r for the rank.
SBoRA: Arithmetic Reasoning Performance

Comparison of LLaMA-7B and LLaMA3-8B with different PEFT methods, and evaluating on the arithmetic reasoning task.
The first row list out the results of GPT-3.5 for reference. We report the accuracy results for each of the sub-tasks as well as
the average accuracy, higher is better for all metrics. The number of trainable parameters (TP) can be found in each row.
QSBoRA: MMLU Performance

Finetune LLaMA-7B/13B and LLaMA3-8B on Alpaca and Flan v2. We measured the performance
using MMLU benchmark and report the 5-shot average accuracy. The training settings and the
number of trainable parameters (TP) are included.
SBoRA: Diffusion Model based Text-to-Image Generation

Qualitative comparison of single-concept SBoRA diffusion model image generation. Reference images
for each concept is shown in the left column. LoRA-based method outperforms Custom Diffusion in
terms of fidelity. Furthermore, Orthogonal Adaptation and SBoRA exhibit comparable performance to
Mix-of-show, while also introducing orthogonal constraints that confer advantages in multi-concept
scenarios.
SBoRA: Diffusion Model based Text-to-Image Generation

Quantitative comparison result of SBoRA single concept tuning of image generation in diffusion model. Previous
methods have exhibited varying performance across different concepts or metrics. Custom Diffusion, for
instance, proves to be less effective in preserving image alignment, whereas Mix-of-show and Orthogonal
Adaptation encounter challenges in maintaining text alignment. In contrast, our proposed method achieves
comparable performance and results, demonstrating a more stable score across all concepts and metrics.
Multi-SBoRA: Diffusion Model based Text-to-Image Generation
Summary of SBoRA
• The SBoRA approach enables regional weight updates, preserving most of the pre-trained
model weights while efficiently adapting to new tasks. This localized learning process draws
parallels with the modular organization of the brain, where distinct cognitive functions are
localized to specific brain regions. This analogy highlights the potential of SBoRA to inspire
AI architectures that mimic the efficiency and adaptability of biological neural systems.
• SBoRA holds immense potential for further development, particularly in multi-task training.
The introduction of Multi-SBoRA would create a powerful framework for efficient
adaptation to multiple tasks. Each task would undergo independent, non-overlapping
weight fine-tuning with SBoRA, allowing the integration of task-specific knowledge while
minimizing interference and maximizing the model’s capacity to leverage shared
information. This approach enables the maintenance of distinct capabilities for each task
within a single model, paving the way for more efficient and effective AI systems

AI in 100 Images
No ratings yet
AI in 100 Images
104 pages
LangChain Programming For Beginners
No ratings yet
LangChain Programming For Beginners
154 pages
Lecture 3 Finetuning Part 1
No ratings yet
Lecture 3 Finetuning Part 1
85 pages
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
100% (1)
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
504 pages
Parameter-Efficient Fine-Tuning Methods For Pretrained Language Models - A Critical Review and Assessment
No ratings yet
Parameter-Efficient Fine-Tuning Methods For Pretrained Language Models - A Critical Review and Assessment
20 pages
Mercity - Ai-Guide To Fine-Tuning LLMs Using PEFT and LoRa Techniques
No ratings yet
Mercity - Ai-Guide To Fine-Tuning LLMs Using PEFT and LoRa Techniques
25 pages
Full Fine-Tuning, PEFT, Prompt Engineering, or RAG
No ratings yet
Full Fine-Tuning, PEFT, Prompt Engineering, or RAG
23 pages
Dive Into LoRA Adapters
No ratings yet
Dive Into LoRA Adapters
15 pages
Advanced Prompt Engineering
No ratings yet
Advanced Prompt Engineering
27 pages
How To Fine-Tune LLMs in 2024 With Hugging Face
100% (1)
How To Fine-Tune LLMs in 2024 With Hugging Face
13 pages
Building LLaMA 3 From Scratch With Python
No ratings yet
Building LLaMA 3 From Scratch With Python
34 pages
Linear Algebra - Intuition, Math, Code
No ratings yet
Linear Algebra - Intuition, Math, Code
565 pages
Agents White Paper
100% (1)
Agents White Paper
21 pages
Fine Tuning
No ratings yet
Fine Tuning
24 pages
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
Matthew Lamons - Rahul Kumar - Abhishek Nagaraja Python Deep Learning Projects - Data PDF
No ratings yet
Matthew Lamons - Rahul Kumar - Abhishek Nagaraja Python Deep Learning Projects - Data PDF
130 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
Federated Learning - Hope and Scope
No ratings yet
Federated Learning - Hope and Scope
4 pages
Water Calculation
100% (2)
Water Calculation
38 pages
Fine Tuning Techniques For Large Language Models LLMs
No ratings yet
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Uppen FP Series FP 2400Q Service Manual
No ratings yet
Uppen FP Series FP 2400Q Service Manual
47 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
Fine Tuning LLM For Enterprise: Practical Guidelines and Recommendations
No ratings yet
Fine Tuning LLM For Enterprise: Practical Guidelines and Recommendations
17 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
5 Techiques To FineTune LLMs
No ratings yet
5 Techiques To FineTune LLMs
7 pages
06. ĐỀ SỐ 06 HSG ANH 9 (HUYỆN)
No ratings yet
06. ĐỀ SỐ 06 HSG ANH 9 (HUYỆN)
7 pages
Newwhitepaper Agents2
No ratings yet
Newwhitepaper Agents2
84 pages
Building Effective Agents - Anthropic
No ratings yet
Building Effective Agents - Anthropic
14 pages
Levels of AI Agents - From Rules To Large Language Models
No ratings yet
Levels of AI Agents - From Rules To Large Language Models
8 pages
Fine-Tuning Legal-BERT - LLMs For Automated Legal Text Classification - by Drewgelbard - Nov, 2024 - Towards AI
No ratings yet
Fine-Tuning Legal-BERT - LLMs For Automated Legal Text Classification - by Drewgelbard - Nov, 2024 - Towards AI
27 pages
Using Chatgpt With Prompt Engineering
No ratings yet
Using Chatgpt With Prompt Engineering
22 pages
Techniques To FineTune LLMs
No ratings yet
Techniques To FineTune LLMs
7 pages
FairEval - Evaluating Fairness in LLM-Based Recommendations With Personality Awareness
No ratings yet
FairEval - Evaluating Fairness in LLM-Based Recommendations With Personality Awareness
11 pages
Effective Prompt Engineering For LLMs - A Developer's Guide To Advanced AI Techniques - by Pankaj - Nov, 2024 - Medium
No ratings yet
Effective Prompt Engineering For LLMs - A Developer's Guide To Advanced AI Techniques - by Pankaj - Nov, 2024 - Medium
16 pages
Math C4 Practice
No ratings yet
Math C4 Practice
53 pages
Introducing Multimodal Llama 3.2
No ratings yet
Introducing Multimodal Llama 3.2
29 pages
Academic Research Assistance 1716570959
No ratings yet
Academic Research Assistance 1716570959
13 pages
Practical Guide To Using LLMs by Andrej Karpathy Feb 29 2025
No ratings yet
Practical Guide To Using LLMs by Andrej Karpathy Feb 29 2025
8 pages
Building Your Own Autonomous LLM Agents - LinkedIn
No ratings yet
Building Your Own Autonomous LLM Agents - LinkedIn
33 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
Function Calling - OpenAI API
No ratings yet
Function Calling - OpenAI API
5 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Transformers
No ratings yet
Transformers
21 pages
Prompt Injection Attacks in Defended Systems
No ratings yet
Prompt Injection Attacks in Defended Systems
10 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
LlamaIndex Talk (W&B Fully Connected 2024)
No ratings yet
LlamaIndex Talk (W&B Fully Connected 2024)
38 pages
Knowledge Graph Construction Using Large Language Models
No ratings yet
Knowledge Graph Construction Using Large Language Models
17 pages
LLaVA - Large Multimodal Model
No ratings yet
LLaVA - Large Multimodal Model
15 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Prompt Engineering Notes
No ratings yet
Prompt Engineering Notes
2 pages
Machine Learning Curriculum Berkley
100% (1)
Machine Learning Curriculum Berkley
12 pages
Hugging Face Transformers
No ratings yet
Hugging Face Transformers
8 pages
Community Session IndexingChaining
No ratings yet
Community Session IndexingChaining
19 pages
Text Generation - OpenAI API
No ratings yet
Text Generation - OpenAI API
12 pages
Chapter 2 Opaud
No ratings yet
Chapter 2 Opaud
5 pages
NFXP2-SSG-QM-MSC-00002 - QAQC Requirements For Vendors - Rev.1
No ratings yet
NFXP2-SSG-QM-MSC-00002 - QAQC Requirements For Vendors - Rev.1
13 pages
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
No ratings yet
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
12 pages
ACW Flow Calculation Basis
No ratings yet
ACW Flow Calculation Basis
4 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Everything You Need To Know About Small Language Models (SLM) and Its Applications
No ratings yet
Everything You Need To Know About Small Language Models (SLM) and Its Applications
3 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
After Effects Expressions
No ratings yet
After Effects Expressions
9 pages
Altoros Tensorflow Cheat Sheet
100% (1)
Altoros Tensorflow Cheat Sheet
1 page
Sci Bono Mathematics Module 4 5 Final
No ratings yet
Sci Bono Mathematics Module 4 5 Final
85 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
BARTEC Engineers Manual
No ratings yet
BARTEC Engineers Manual
12 pages
Micro-Framework: Presented By-Khirod Kumar Behera
No ratings yet
Micro-Framework: Presented By-Khirod Kumar Behera
10 pages
Volume Shockers (Stocks With Rising Volumes), Technical Analysis Scanner
No ratings yet
Volume Shockers (Stocks With Rising Volumes), Technical Analysis Scanner
2 pages
Contributions of Muslim Scientists
No ratings yet
Contributions of Muslim Scientists
5 pages
SuperiorBroomDT80 CT
No ratings yet
SuperiorBroomDT80 CT
2 pages
Hazardous Area Ventilation Sce Performance Standard
No ratings yet
Hazardous Area Ventilation Sce Performance Standard
82 pages
Carbon and Alloy Steel Nuts For Bolts For High Pressure or High Temperature Service, or Both
No ratings yet
Carbon and Alloy Steel Nuts For Bolts For High Pressure or High Temperature Service, or Both
11 pages
MOS-II Lecture 01 - Stress Analysis
No ratings yet
MOS-II Lecture 01 - Stress Analysis
25 pages
Week 7 Milestone Worksheet Completed
No ratings yet
Week 7 Milestone Worksheet Completed
16 pages
CAWRT Drill Flyer
No ratings yet
CAWRT Drill Flyer
1 page
Foto Electrici2
No ratings yet
Foto Electrici2
98 pages
ChE 311
No ratings yet
ChE 311
2 pages
Walking in Clutha Brochure
No ratings yet
Walking in Clutha Brochure
4 pages
Alternative Binder Systems For Lower Carbon Concrete Code of Practice
No ratings yet
Alternative Binder Systems For Lower Carbon Concrete Code of Practice
8 pages
Climate Change
No ratings yet
Climate Change
5 pages
Top 10 DAX Interview Questions and Answers
No ratings yet
Top 10 DAX Interview Questions and Answers
3 pages
Terms of Reference Microeconomic/econometric Consultant For The Poverty and Equity Global Practice
No ratings yet
Terms of Reference Microeconomic/econometric Consultant For The Poverty and Equity Global Practice
2 pages
People Code Data
No ratings yet
People Code Data
39 pages
Number Detection System Using CNN Research Paper
No ratings yet
Number Detection System Using CNN Research Paper
5 pages
GPS Unit 2 Assignment Sheet
No ratings yet
GPS Unit 2 Assignment Sheet
3 pages
EBD Blades Sponsorhip Letter
No ratings yet
EBD Blades Sponsorhip Letter
2 pages
Fascinating Photos of Afghanistan in The 1960s Show Life Before The Taliban
No ratings yet
Fascinating Photos of Afghanistan in The 1960s Show Life Before The Taliban
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Parameter Efficient Fine Tuning 1735415619

Uploaded by

Parameter Efficient Fine Tuning 1735415619

Uploaded by

Parameter-Efficient Fine-Tuning (PEFT)

AI with Deep Learning

• In most LLM-powered applications, Pretraining Finetuning

Text- Task-Specific Preference

Massive Curated Preference

Finetuned Model for Task 2 8B 32B parameters

8. SBoRA (2024) – Standard Basis LoRA

• The adapter Layers are extra

• This means we introduced 1,024 x 24 + 24 x 1,024 = 49,152 weight

Pre-trained Pre-trained 𝐁=0

Full Fine-Tuning (FFT) LoRA

§ 𝑟 ≪ min 𝑑, 𝑘 and 𝑟 is usually between 1 to 32 Weights 𝑟

𝐡 ∈ ℝ"×& 𝐱 ∈ ℝ$×& 𝐁 ∈ ℝ"×, 𝐱 ∈ ℝ$×&

0.3 −0.14 0.15 −0.14 −0.21 0.612

Total parameter count

LoRA is extremely competitive with full finetuning

• QLoRA (Quantized LoRA) integrates two concepts:

𝐘 <:)= = 𝐗 <:)= doubleDequant 𝑐):;+* , 𝑐*>?@A8 , 𝐖 B:, + 𝐗 <:)= 𝐋<:)=

where dequant G is defined as:

• We use a block-size of 64 for W for higher quantization precision

0.127 bits of memory

scaling each column of the matrix to Direction Direction

• The magnitude vector stores the scaling Pretrained

that 𝐕/ 𝐕 H is unit vector 𝐦 ∈ ℝ,×+

§ 𝐕 ∈ ℝ.×0 is the direction matrix with initial value of 𝐕- = 𝐖- Direction

§ ∆𝐕 is the incremental direction update.

∑0L?) 1 − 𝐜𝐨𝐬 𝐕IJ

Negative Correlation Positive Correlation

The distribution for DoRA closely resembles FFT.

𝑏&& 𝑏&) 𝑏&& 0 0 𝑏&)

𝑤&& + 𝑏&& 𝑤&) 𝑤&* 𝑤&+ + 𝑏&)

1 0 𝑏&& 𝑏&) 𝑏&* 𝑏&+

𝑤&& + 𝑏&& 𝑤&) + 𝑏&) 𝑤&* + 𝑏&* 𝑤&+ + 𝑏&+

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.