0% found this document useful (0 votes)
9 views6 pages

Quq 1528

The document presents Quadruplet Uniform Quantization (QUQ), a novel quantization technique designed to enhance the efficiency of Vision Transformer (ViT) inference by addressing the challenges posed by diverse data distributions. QUQ divides the data range into up to four subranges, applying uniform quantization with distinct scale factors to improve accuracy while reducing memory overhead. Experimental results demonstrate that QUQ outperforms existing quantization methods, achieving full quantization of ViTs to 6-bit with acceptable accuracy and lower resource requirements.

Uploaded by

watch23free
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

Quq 1528

The document presents Quadruplet Uniform Quantization (QUQ), a novel quantization technique designed to enhance the efficiency of Vision Transformer (ViT) inference by addressing the challenges posed by diverse data distributions. QUQ divides the data range into up to four subranges, applying uniform quantization with distinct scale factors to improve accuracy while reducing memory overhead. Experimental results demonstrate that QUQ outperforms existing quantization methods, achieving full quantization of ViTs to 6-bit with acceptable accuracy and lower resource requirements.

Uploaded by

watch23free
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

QUQ: Quadruplet Uniform Quantization for Efficient Vision

Transformer Inference
Xinkuang Geng1 , Siting Liu2 , Leibo Liu3 , Jie Han4 , Honglan Jiang1
1Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai, China
2 School of Information Science and Technology, ShanghaiTech University, Shanghai, China
3 School of Integrated Circuits, Tsinghua University, Beijing, China
4 Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada

xinkuang@sjtu.edu.cn, liust@shanghaitech.edu.cn, liulb@tsinghua.edu.cn, jhan8@ualberta.ca, honglan@sjtu.edu.cn

ABSTRACT adopt special encoding for specific data in ViT [4, 7, 12], which requires
While exhibiting superior performance in many tasks, vision transform- diverse computing hardware; thus, they are infeasible for hardware
ers (ViTs) face challenges in quantization. Some existing low-bit-width with a single type of arithmetic units or a common architecture. To
quantization techniques cannot effectively cover the whole inference support these quantization strategies, additional hardware is necessary.
process of ViTs, leading to an additional memory overhead (22.3%- To guarantee an efficient hardware implementation, ViT is expected
172.6%) compared with corresponding fully quantized models. To ad- to be fully quantized. Thus, a quantization technique that can be adapt-
dress this issue, we propose quadruplet uniform quantization (QUQ) to able to diverse data distribution characteristics is needed. We observe
deal with data of various distributions in ViT. QUQ divides the entire that data in ViT shows certain common traits: ○ 1 Most elements cluster
data range into at most four subranges that are uniformly quantized around zero, and outliers exhibit a wide range; ○ 2 positive and nega-
with different scale factors. To determine the partition scheme and tive data show different distribution characteristics. Thus, we propose
quantization parameters, an efficient relaxation algorithm is proposed quadruplet uniform quantization (QUQ) to leverage these traits. Specifi-
accordingly. Moreover, dedicated encoding and decoding strategies are cally, the entire data range is divided into at most four subranges based
devised to facilitate the design of an efficient accelerator. Experimental on the proposed progressive relaxation algorithm. The data belong-
results show that QUQ surpasses state-of-the-art quantization tech- ing to each subrange are then uniformly quantized with a particular
niques; it is the first viable scheme that can fully quantize ViTs to 6-bit scale factor. Also, for certain data, QUQ allows the merging of encod-
with acceptable accuracy. Compared with conventional uniform quanti- ing spaces between subranges with different signs, enabling dynamic
zation, QUQ leads to not only a higher accuracy but also an accelerator adjustments of different data distribution characteristics.
with lower area and power. Our major contributions are summarized as follows.
• We characterize the data distribution in ViT and propose QUQ ac-
1 INTRODUCTION cordingly to enable full quantization.
With the development of deep learning techniques, neural networks • A progressive relaxation algorithm is introduced to determine the
(NNs) of an increasingly large scale have been applied to a wide range quantization parameters of QUQ.
of domains. However, severe challenges arise when deploying these • The quadruplet uniform byte (QUB) is proposed to facilitate the
models onto edge devices with limited storage and computational capac- encoding and decoding of QUQ, which is then utilized for the design
ity. Making on-device inference feasible for large models, quantization of a QUQ-compatible accelerator.
effectively compresses the models and reduces resource requirements • The performance of QUQ is evaluated in three ViT models for image
without modifying the NN structure. Concerning data privacy and re- classification. Experimental results show that QUQ results in higher
training costs, post-training quantization (PTQ) is receiving increasing accuracy than state-of-the-art quantization methods.
attention. • Compared with the accelerator for uniform quantization, our design
Taking advantage of the attention mechanism [11], vision transform- demonstrates superior efficiency in area and power while maintain-
ers (ViTs) achieve superior performance in various image processing ing accuracy at a lower bit-width.
tasks [3]. However, compared to convolutional neural networks (CNNs),
the diverse computing modes in ViT introduce data with significantly 2 BACKGROUND
different distribution characteristics, leading to great challenges in As shown in Figure 1, a typical ViT mainly consists of cascaded multi-
quantization. head self-attention (MSA) modules and multi-layer perceptron (MLP)
When quantizing a ViT, many existing works [2, 12] focus only on modules. In addition, layer normalization (LayerNorm) and residual
the input of compute-intensive operations that can be implemented by connection (element-wise addition) are commonly performed before
general matrix multiplication (GEMM). However, other activations that and after each module, respectively.
are difficult to be uniformly quantized remain untouched, which results
in a significant resource overhead for inference, including extra floating-
point operations and increased memory. Additionally, some works Add 8-bit
Linear
32-bit
MLP
Permission to make digital or hard copies of all or part of this work for personal or classroom
MatMul
use is granted without fee provided that copies are not made or distributed for profit or LayerNorm
commercial advantage and that copies bear this notice and the full citation on the first Softmax Linear
page. Copyrights for components of this work owned by others than the author(s) must be
Add
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on MatMul GELU
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request Q K V MSA
permissions from permissions@acm.org. Linear Linear
DAC ’24, June 23–27, 2024, San Francisco, CA, USA LayerNorm
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0601-1/24/06
https://doi.org/10.1145/3649329.3656516 Figure 1: Data flow of a partially quantized ViT block.
DAC ’24, June 23–27, 2024, San Francisco, CA, USA Xinkuang Geng et al.

In general, GEMM operations (Linear and MatMul) are quantized 3.2 Quantization Scheme
(shown in the green components in Figure 1), ensuring that most com- Figure 3 shows the distributions of the weights for the query matrix in
putations operate on low-bit-width integers [2, 12]. However, the inputs MSA, and the activations after Softmax, before element-wise addition,
of residual connection, LayerNorm, Softmax, and GELU are not effec- and after GELU, respectively. As per Figure 3, two observations are
tively quantized (shown in the red components); thus, high-bit-width obtained. ○ 1 Data in ViT commonly follows a long-tailed distribution,
activations exist in the data flow. This necessitates high-precision com- i.e., most elements concentrate around zero and outliers lie within a
putation units and large intermediate storage for hardware deployment. broad range. In this case, determining an appropriate scale factor be-
We simulate and count the required sizes of the on-chip memory for comes challenging. A large 𝛥 results in sparse resolution around zero,
the ViT blocks (shown in Figure 1) of different scales during inference. while a small 𝛥 clips too many outliers to small values. ○ 2 Some data
In this simulation, we assume that only the weights required for the are not symmetrically distributed in the positive and negative parts,
current operations are loaded during inference, as it is impractical to e.g., the output of GELU shows significant differences between the
load the entire model into on-chip memory in edge devices. Additionally, positive and negative parts, and the output of Softmax contains only
considering the dynamic generation and usage of activations, it is non-negative values. Consequently, some discrete points of symmetric
assumed that they are always stored on-chip to avoid extra accesses to uniform quantization may not represent any elements, resulting in
off-chip memory. a waste of encoding space. To sum up, symmetric uniform quantiza-
Figure 2 illustrates that, compared to the partially quantized ViT tion is not suitable for the quantization of ViT due to its various data
models (PQ), the fully quantized models (FQ) exhibit much lower peak distributions.
memory demands. The predominance becomes more evident in small
models, which is of particular concern for edge devices. Moreover, in- 105

Count
creasing the batch size, which enables an improved throughput, further 103 Overlap of F / C
enhances the superiority of the full quantization method. This occurs 101
0.6 0.4 0.2 0.0 0.2 0.4
because a higher ratio of activations would occur when a larger batch (a) Query Weights
size is utilized. 105
Count
103
Batch Size = 1 Batch Size = 16 101
6.31 0.0 0.2 0.4 0.6 0.8 1.0
PQ FQ PQ FQ 59.45 (b) Post-Softmax Activations
Peak Memory (MB)

Peak Memory (MB)

6 5.16 60
44.58 105
3.98
4 40
Count

3.12 103
22.29 22.47 101
2 1.431.00 20 11.15 16.34
0.700.36 4.09 8.17 30 20 10 0 10
0 ViT-T 0 ViT-T (c) Pre-Addition Activations
ViT-S ViT-B ViT-L ViT-S ViT-B ViT-L 105
Count

103
Figure 2: Peak memory usage in ViT blocks.
101
0 1 2 3 4 5 6 7
However, a low quantization bit-width leads to a significant accuracy (d) Post-GELU Activations
drop in fully quantized ViT models. Therefore, to obtain the hardware F F+ C C+
benefits from full quantization, a technique that can effectively quantize
Figure 3: The distributions of weights and activations from differ-
data with various characteristics in ViT is urgently needed.
ent modules in ViT and the corresponding quantization points
(vertical lines) generated by QUQ.
3 QUADRUPLET UNIFORM QUANTIZATION
3.1 Preliminaries To effectively discretize these diverse data, we propose quadruplet
As the most commonly used data discretization method, symmetric uniform quantization (QUQ). For the cases where outliers appear on
uniform quantization [9] can be expressed as both sides of zero, as shown in Figures 3 (a) and (c), QUQ first divides the
𝑥 entire quantization range into the negative part 𝑅𝐶 − and the positive
𝑥ˆ = 𝑈𝑏 (𝑥; 𝛥) = clip(⌊ ⌉; −2𝑏 −1, 2𝑏 −1 − 1), (1) part 𝑅𝐶 + that are assigned coarse quantization intervals (scale factors),
𝛥
ensuring that outliers are not clipped. Subsequently, considering that
where 𝑏 is the quantization bit-width, 𝛥 represents the interval between
most elements concentrate around zero, we further isolate two smaller
adjacent discrete points and is also known as the scale factor. ⌊·⌉ denotes
ranges, 𝑅𝐹 − and 𝑅𝐹 + , also with zero as the boundary and being assigned
the nearest rounding operation and clip(·) constraints the quantized
fine intervals to reduce quantization error. Finally, uniform quantization
data within the range of 𝑏-bit integer. A smaller 𝛥 results in a higher
is applied to each subrange based on its scale factor. Therefore, QUQ
quantization resolution but it causes more outliers to be clipped.
Í requires up to four different scale factors, namely 𝛥𝐹 − , 𝛥𝐹 + , 𝛥𝐶 − , and
After quantization, a dot product 𝑦𝑘 = 𝑥𝑖 𝑤𝑖 that is commonly
𝛥𝐶 + , as shown in Mode A in Figure 4. 𝑏-bit QUQ can be implemented
performed in GEMM can be approximated as
by four symmetric uniform quantizers as
1 ∑︁ 𝛥𝑥 𝛥𝑤 ∑︁ 𝑀 ∑︁ 𝑈𝑏 −1 (𝑥; 𝛥𝐶 − ) 𝑥 ∈ 𝑅𝐶 − − 𝑅𝐹 −
𝑦ˆ𝑘 ≈ 𝛥𝑥 𝑥ˆ𝑖 𝛥𝑤 𝑤ˆ 𝑖 = 𝑥ˆ𝑖 𝑤ˆ 𝑖 ≈ 𝑁 𝑥ˆ𝑖 𝑤ˆ 𝑖 , (2) 

𝛥𝑦 𝛥𝑦 2 
(𝑥; 𝛥𝐹 − )
𝑥 ∈ 𝑅𝐹 −

𝑈
𝑏 −1 

which indicates that all multiplications occur exclusively between the 𝑥ˆ = 𝑄𝑏 (𝑥; 𝛥𝐹 −,𝐹 + ,𝐶 − ,𝐶 + ) = . (3)

 𝑈𝑏 −1 (𝑥; 𝛥𝐹 + ) 𝑥 ∈ 𝑅𝐹 +
quantized values, as the scale factor 𝛥𝑥 or 𝛥𝑤 is shared by every element 

 𝑏 −1 (𝑥; 𝛥𝐶 + ) 𝑥 ∈ 𝑅𝐶 + − 𝑅𝐹 +
𝑈
in the corresponding matrix. In some integer-only implementations
[5, 6], the floating-point operations for the scale of an accumulation Since each symmetric uniform quantizer 𝑈𝑏 −1 receives inputs from only
result are often replaced with multiplication and shift of integers, i.e., one side of zero, it produces up to 2𝑏 −2 different quantization results.
Í
𝑥ˆ𝑖 𝑤ˆ 𝑖 is multiplied by an integer 𝑀 and then shifted right by 𝑁 bits. This means that we assign a quarter of the encoding space to each
QUQ: Quadruplet Uniform Quantization for Efficient Vision Transformer Inference DAC ’24, June 23–27, 2024, San Francisco, CA, USA

A RF- RF+ F7-0=1x-001-000 sF-=21 sF+=20


B RF+ (merged) F7-0=00-xxx-000 sF-=2x sF+=20
∆F-=2∆ ∆F+=∆ C7-0=1x-010-011 sC-=22 sC+=23 ∆F+=∆ C7-0=00-xxx-010 sC-=2x sC+=22

0 0
∆C-=4∆ ∆C+=8∆ ∆C+=4∆
RC- RC+ RC+ (merged)

C RF- RF+ F7-0=1x-000-001 sF-=20 sF+=21


D F7-0=00-xxx-000 sF-=2x sF+=20 RF+ (merged)
∆F-=∆ ∆F+=2∆ C7-0=00-xxx-010 sC-=2x sC+=22 C7-0=01-010-xxx sC-=22 sC+=2x ∆F+=∆

0 0
∆C+=4∆ ∆C-=4∆
RC+ (merged) RC- (merged)

Figure 4: Quantization points of QUQ and the corresponding FC Registers in four different modes.

subrange, e.g., for 𝑥 ∈ 𝑅𝐹 + , 𝑥ˆ is a (𝑏−2)-bit unsigned integer. As shown based on the calibration data and ensure that the scale factors satisfy
in Figure 4, the overlap of different quantization points may occur (4). To this end, we propose a progressive relaxation algorithm, which
because a coarse range may include a fine range. Although this results is formulated based on the following two guiding principles.
in potential encoding inefficiency, it is important that each subrange is ○
1 The ratio between coarse and fine quantization subranges should
bounded by zero. This ensures that the encoding of each quantization be as large as possible, which reduces the encoding space wastage
point is proportional to the original value, eliminating the need to store caused by the overlap of subranges.
and process additional zero points. ○
2 The fine quantization subrange should cover as many elements as
So far, a dot product cannot be implemented solely through integer possible, as it allows quantization with higher resolution.
multiplications between quantized values as shown in (2), because the First, we propose Algorithm 1 to relax two scale factors 𝛥1 and 𝛥2 ,
scale factor of each element can take four different values depending one of which will be modified to satisfy (4), based on the rounding
on the subrange it falls into. To reduce the hardware complexity, we direction of 𝐿, which is the ratio of the two scale factors in the loga-
enforce the relationship among the four scale factors as rithmic domain. This ensures that the larger scale factor is not reduced
𝛥𝐹 − 𝛥 + 𝛥 − 𝛥 + and avoids clipping for the original data.
= 𝐹 = 𝐶 = 𝐶 = 𝛥, 𝑠 = 20, 21, 22, · · · (4)
𝑠𝐹 − 𝑠𝐹 + 𝑠𝐶 − 𝑠𝐶 +
Consequently, when performing a dot product, the shared 𝛥 for the Algorithm 1 Relax Two Scale Factors.
same vector can be extracted. In this way, the multiplication occurs Require: 𝛥1 > 0 and 𝛥2 > 0
only between quantized results scaled by integer powers of two that 𝛥1
Ensure: 𝛥 2
= 2𝑘 , 𝑘 = · · · , −2, −1, 0, 1, 2, · · ·
can be simplified into a shift operation on the product as 1 function Relax(𝛥1 , 𝛥2 )
1 ∑︁ 2 𝐿 ← log2 𝛥 2
𝑦ˆ𝑘 = 𝑠𝑥𝑖 𝛥𝑥 𝑥ˆ𝑖 𝑠 𝑤𝑖 𝛥𝑤 𝑤ˆ 𝑖 𝛥1
𝑠 𝑦𝑘 𝛥 𝑦 3 if ⌊𝐿⌉ > 𝐿 then 𝛥1 , 𝛥2 ← 𝛥1 , 2 ⌊𝐿⌉ 𝛥1 ⊲ make 𝛥2 larger
(5)
1 𝛥𝑥 𝛥𝑤 ∑︁ 4 else 𝛥1 , 𝛥2 ← 2 − ⌊𝐿⌉ 𝛥2 , 𝛥2 ⊲ make 𝛥1 larger
= 𝑥ˆ𝑖 𝑤ˆ 𝑖 ≪ (log2 𝑠𝑥𝑖 + log2 𝑠 𝑤𝑖 ).
𝑠 𝑦𝑘 𝛥 𝑦 5 return 𝛥1 , 𝛥2
As discussed above, QUQ divides the entire range into four subranges,
with each subrange being assigned a quarter of the encoding space. To Based on Algorithm 1, the quantization parameters are determined
better accommodate the diverse data distributions in ViT, QUQ enables by Algorithm 2. In this step, the coarse scale factors are obtained by
the merging of two subranges with the same granularity.For instance, using the maximum and minimum values of the calibration data as the
as shown in Figure 3 (d), as the negative part does not have outliers, boundaries for coarse uniform quantization. The boundaries for the
𝑅𝐶 − becomes unnecessary, allowing the corresponding encoding space fine subranges are set as the 𝑞-th quantile points, where the initial 𝑞
to be utilized for finer quantization of 𝑅𝐶 + . Following this idea, the is a hyperparameter. We apply three rounds of Algorithm 1 to ensure
quantization points of QUQ can differentiate into several modes, as that these four scale factors satisfy (4), as shown in Lines 4 to 8.
shown in Figure 4. Once the four scale factors are obtained under the assumption of
In Mode A, as the general form of QUQ, no merging occurs. In Mode Mode A, further relaxing or mode switching is performed by using four
B, merging occurs at both fine and coarse quantization subranges. Mode branches, as shown in Lines 10 to 17.
C is designed for the data without outliers on either side of zero, where In the first branch, the ratios between the coarse and fine scale
the encoding spaces for the coarse subranges are merged. In Mode D, factors for both positive and negative parts fall below 𝜆𝐴 , indicating
the fine and coarse subranges are merged separately, and their encoding unacceptable wastage of encoding space. In this case, the current coarse-
spaces are assigned to the different sides of zero. As a result, both the fine range partition scheme is considered unsuitable under the quantile
positive and negative parts degenerate into uniform quantization. point 𝑞; thus, Algorithm 2 is restarted with a smaller 𝑞, i.e., relaxing
Furthermore, setting 𝛥𝐶 − = 𝛥𝐹 + in Mode D generates uniform quan- Principle ○2 to satisfy Principle ○ 1 . As the endpoint of the recursion,
tization points throughout the entire range. Thus, symmetric uniform 𝑞𝐴 also limits the minimum proportion of the data covered by fine
quantization can be considered as a special case of QUQ. This indicates subranges.
that, with appropriate quantization settings, the performance of QUQ In the second and third branches, either the positive or negative
for any type of data will not be inferior to that of symmetric uniform part processes an unsuitable coarse-fine range partition scheme and
quantization. has a sufficiently small boundary. In this case, uniform quantization
is performed for the corresponding part with the initial coarse scale
3.3 Progressive Relaxation Algorithm factor; the encoding space of the subrange is merged into that of the
As the quantization target of QUQ contains multiple modes, a cus- coarse subrange in the other side of zero, enhancing its quantization
tomized strategy is needed to determine the quantization parameters resolution. Note that, in the pseudo-code, setting the scale factor of
DAC ’24, June 23–27, 2024, San Francisco, CA, USA Xinkuang Geng et al.

Algorithm 2 Progressive Relaxation Algorithm. in addition to the base scale factor 𝛥, additional registers (denoted as
𝛥𝐶 FC Registers) are required for each tensor to store the QUQ mode and
Input: Tensor x, Quantization Bit-Width 𝑏, Acceptable Ratio 𝜆𝐴 of 𝛥𝐹 , Initial
Quantile 𝑞, Acceptable Quantile 𝑞𝐴
the ratios of the four scale factors to 𝛥.
Output: Quantization Parameters 𝛥𝐹 − , 𝛥𝐹 + , 𝛥𝐶 − , 𝛥𝐶 + Reg F f7 f6 f5-3 f2-0 0 E6-0,0 Use Reg C
1 function PRA(x, 𝑏, 𝜆𝐴 , 𝑞, 𝑞𝐴 )
Reg C c7 c6 c5-3 c2-0 1 E6-0,1 Use Reg F
2 ⊲ determine parameters for Mode A ⊳
1 E6-0,2 Use Reg F
3 x − , x+ ← −x[x < 0], x[x > 0]
4 𝛥𝐶 − , 𝛥𝐶 + ← Relax( Max(x
− ) Max(x+ )
, 𝑏 −2 ) ⊲ relaxation round 1
is signed nsh for C+
⋯ ⋯
2𝑏 −2 2 −1 is negetive part nsh for C- 0 E6-0,N Use Reg C
Quantile(x − ,𝑞) Quantile(x+ ,𝑞)
5 𝛥𝐹 − , 𝛥𝐹 + ← Relax( , ) ⊲ relaxation round 2
2𝑏 −2 2𝑏 −2 −1
𝛥 − 𝛥 − Figure 5: (Left) FC Registers. (Right) QUBs.
6 𝑠 𝐹 , 𝑠𝐶 ← 𝛥𝐹 + , 𝛥𝐶 + ⊲ record the ratios before relaxing 𝐹 and 𝐶
𝐹 𝐶
7 𝛥𝐹 + , 𝛥𝐶 + ← Relax(𝛥𝐹 + , 𝛥𝐶 + ) ⊲ relaxation round 3 Two 8-bit registers are utilized to encode the parameters for coarse
8 𝛥𝐹 − , 𝛥𝐶 − ← 𝑠 𝐹 𝛥𝐹 + , 𝑠𝐶 𝛥𝐶 + ⊲ Mode A or fine subranges, denoted as FC Registers. For example, 𝑐 7 indicates
9 ⊲ further relax or switch the mode ⊳ whether the coarse subrange contains both positive and negative parts,
𝛥𝐶 − 𝛥𝐶 +
10 if 𝛥𝐹 − < 𝜆𝐴 and 𝛥𝐹 + < 𝜆𝐴 and 𝑞 > 𝑞𝐴 then to support the modes where the encoding spaces of different subranges
11 return PRA(x, 𝑏, 𝜆𝐴 , 𝑞 − 0.01, 𝑞𝐴 ) ⊲ recursively relax are merged. If it contains only one part (𝑐 7 = 0), then 𝑐 6 indicates
𝛥𝐶 − whether the merged part is negative. 𝑐 5−3 and 𝑐 2−0 encode log2 𝑠𝐶 −
12 if 𝛥𝐹 − < 𝜆𝐴 and 𝛥𝐶 − ≤ 𝛥𝐹 + then
13
𝛥 +
𝛥𝐹 − , 𝛥𝐶 − , 𝛥𝐶 + ← 𝛥𝐶 − , ∅, 𝐶2 ⊲ Mode C
and log2 𝑠𝐶 + , respectively, representing the number of shifted bits 𝑛𝑠ℎ
𝛥𝐶 +
when performing the multiplication as shown in (5).
14 if 𝛥𝐹 + < 𝜆𝐴 and 𝛥𝐶 + ≤ 𝛥𝐹 − then For each QUB, taking 8-bit QUQ as an example, 𝐸 7−0 is utilized
𝛥 −
15 𝛥𝐹 + , 𝛥𝐶 − , 𝛥𝐶 + ← 𝛥𝐶 + , 𝐶2 , ∅ ⊲ Mode C to represent the encoded result, where 𝐸 7 indicates whether the QUB
𝛥𝐶 − 𝛥𝐶 +
16 if 𝛥𝐹 − < 𝜆𝐴 or 𝛥𝐹 + < 𝜆𝐴 then belongs to the fine or coarse subranges. If it belongs to coarse subranges
17
𝛥 −
𝛥𝐹 − , 𝛥𝐹 + , 𝛥𝐶 − , 𝛥𝐶 + ← 𝐶2 , ∅, ∅,
𝛥𝐶 +
⊲ Mode D (𝐸 7 = 0), 𝑐 7 determines whether 𝐸 6−0 represents a signed integer. ○ 1
2
18 return 𝛥𝐹 − , 𝛥𝐹 + , 𝛥𝐶 − , 𝛥𝐶 + If so, it means that 𝑅𝐶 − and 𝑅𝐶 + are not merged, and the sign bit 𝐸 6
determines which subrange the QUB falls into. ○ 2 If not, it means that
𝑅𝐶 − and 𝑅𝐶 + are merged, and 𝑐 6 indicates the reserved subrange, which
the subrange to ∅ indicates that the subrange is merged. These two only contains values on one side of zero. Therefore, in this case, the
cases are mapped to Mode C, where the data on either side of zero lacks encoding for the sign bit is unnecessary, and 𝐸 6−0 represents an 8-bit
significant long-tailed distribution characteristics. two’s complement encoding without the sign bit. Based on the encoding
In the last branch, as a fallback, uniform quantization is applied to rules, a straightforward decoding scheme is devised as
the positive and negative parts, respectively, aligning with Mode D.  {1, 𝐸 6−0 } ≪ 𝑓5−3 if 𝐸 7 (𝑓7 𝑓6 + 𝑓7 𝐸 6 )


Additionally, for a non-positive or non-negative tensor x̃, it is first 

 {0, 𝐸 6−0 } ≪ 𝑓2−0 if 𝐸 7 (𝑓7 𝑓6 + 𝑓7 𝐸 6 )


concatenated with −x̃ to form a new tensor. Subsequently, the progres- 
𝑑= , (6)
sive relaxation algorithm is applied to obtain quantization parameters. 

 {1, 𝐸 6−0 } ≪ 𝑐 5−3 if 𝐸 7 (𝑐 7 𝑐 6 + 𝑐 7 𝐸 6 )
Finally, the two scale factors corresponding to the parts of −x̃ are set 

 {0, 𝐸 6−0 } ≪ 𝑐 2−0 if 𝐸 7 (𝑐 7 𝑐 6 + 𝑐 7 𝐸 6 )

to ∅. This implements Mode B. 
Figure 3 shows the 4-bit quantization points generated for data in where the sign extension and the shift count 𝑛𝑠ℎ are determined based
different modules of ViT using the proposed algorithm. It can be seen on the subrange that the QUB falls into. After decoding, the output 𝑑
that the obtained quantization points adequately match the correspond- can be represented by an 8-bit signed number 𝐷 7−0 and a 3-bit 𝑛𝑠ℎ as
ing data distributions. Additionally, we evaluate the quantization errors 𝑑 = {𝑠𝑖𝑔𝑛, 𝐸 6−0 } ≪ 𝑛𝑠ℎ = 𝐷 7−0 ≪ 𝑛𝑠ℎ . (7)
of uniform quantization (BaseQ) and QUQ for the four types of data
in Figure 3, based on the mean squared error (MSE). The results for lt is noteworthy that the decoding result of an 8-bit unsigned QUB
various quantization bit-widths are presented in Table 1. It shows that corresponding to Mode B is also expressed as an 8-bit signed number
QUQ introduces smaller quantization errors than conventional uniform (with sign extension). This means that an 8-bit signed multiplier can
quantization. accommodate QUB in any mode. In contrast, unsigned 8-bit integers
cannot be processed by an 8-bit signed multiplier. Therefore, in uniform
Table 1: MSEs of Different Quantization Methods. quantization, a signed multiplier of a higher bit-width is necessary to
support unsigned integers.
Method Bit Query W Post-Softmax A Pre-Addition A Post-GELU A
BaseQ 4 2.14 × 10 −4 3.55 × 10 −5 3.19 × 10 −1 9.40 × 10 −3 4.2 Accelerator Design
QUQ 4 8.71 × 10 −5 2.34 × 10 −5 8.53 × 10 −2 1.78 × 10 −3 We design a quadruplet uniform accelerator (QUA), as shown in Figure 6.
BaseQ 6 1.93 × 10 −5 8.80 × 10 −6 4.58 × 10 −2 2.97 × 10 −3 As per the representation methods and arithmetic rules of QUB, the
QUQ 6 5.59 × 10 −6 9.39 × 10 −7 5.31 × 10 −3 1.01 × 10 −4 accelerator for QUQ can be supported by adding additional circuits to
BaseQ 8 1.23 × 10 −6 8.13 × 10 −7 4.11 × 10 −3 1.83 × 10 −4 existing designs, shown as the red components in Figure 6.
QUQ 8 3.41 × 10 −7 6.13 × 10 −8 3.29 × 10 −4 6.00 × 10 −6 Decoding unit (DU). DU is devised based on (6) and (7). It takes
a QUB as the input and converts it into a signed integer 𝐷 and an
unsigned integer 𝑛𝑠ℎ .
4 HARDWARE DESIGN Processing element (PE) array. Each PE performs a multiply-
accumulate operation and stores the intermediate results. Once the
4.1 Encoding and Decoding Methods final result is generated, it is output to the quantization unit (QU)
To achieve efficient inference on hardware, we propose using quadruplet in turn. To support QUQ, the main change is that the inputs of the
uniform bytes (QUBs) to encode the QUQ results. As shown in Figure 5, multiply-accumulate operation are no longer individual integers but
QUQ: Quadruplet Uniform Quantization for Efficient Vision Transformer Inference DAC ’24, June 23–27, 2024, San Francisco, CA, USA

QUA FC Regs DU DU
... DU
Acc Reg E6-0 f7-6 c7-6 E7-6 f5-3 f2-0 c5-3 c2-0

D6-0 Comb Mux


DU ... PE nsh w Reg DU D7 nsh
DU PE PE PE

Scratchpad
Shifter

DU PE PE
... PE
Adder Multiplier Acc In
Multiply by M

x Reg

x Reg
LayerNorm LZDs / LODs
Shifter Adder

...

...

...

...
Shift by N
Add
DU PE PE
... PE
f5-3 f2-0 c5-3 c2-0
Acc Reg Mux

nsh

nsh
Softmax Mux Shift by nsh,y

... Quant Out


QU GELU QU QU QU
nsh w Reg QU Round & Clip

Figure 6: (Left) Quadruplet Uniform Accelerator. (Middle) Processing Element. (Right) Decoding Unit and Quantization Unit.

rather 𝐷 and 𝑛𝑠ℎ generated by DUs. As per (5), the hardware overhead 6 EXPERIMENTS
of the arithmetic circuit is limited to a low-bit-width addition and a
6.1 Accuracy Evaluation
shift operation.
Quantization unit (QU). A traditional QU for uniform quantization To evaluate the performance of the proposed QUQ, PTQ experiments
scales the accumulated result followed by clipping and rounding oper- are conducted for image classification on ImageNet [1]. Three models
ations [9]. Comparing (2) and (5), an additional right-shift operation are considered, including ViT [3], DeiT [10], and Swin [8].
is necessary to perform the scaling by 𝑠 𝑦𝑘 . Since the scale factor for Experimental details. We randomly select 32 images from the
𝑦𝑘 is dynamically determined based on the dot product result and FC training dataset of ImageNet as the input for calibration to obtain quan-
Registers, it is necessary to compare it with the boundaries of the four tization parameters. After obtaining the four scale factors following
subranges. These operations can be optimized by detecting the number the steps of our proposed algorithm, we employ a grid search similar
of leading zeros or ones, as the boundary for comparison is either −2𝑏 to [12] to conduct a layer-wise Hessian-based optimization. For the
hyperparameters in Algorithm 2, in all experiments, the acceptable
or 2𝑏 − 1, where 𝑏 ∈ N.
ratio 𝑠𝐴 , initial quantile 𝑞, and acceptable quantile 𝑞𝐴 are set to 4, 0.99,
Special function unit (SFU). The inference process of a ViT re-
and 0.95, respectively.
quires more than just GEMM. Special functions are also needed to
For partial quantization, our method is compared with PTQ4ViT
implement LayerNorm, element-wise addition, Softmax and GELU. We
[12] and APQ-ViT [2] under 6-bit quantization. For a fair comparison,
introduce a QUB decoder and a shifter in the data loading path of SFUs
only operations that can be implemented by GEMM are quantized to
to convert the encoded QUB into an integer 𝑑 = 𝐷 ≪ 𝑛𝑠ℎ . Conse-
6-bit, while the remaining parts are retained in floating-point format.
quently, we can streamline the SFUs to perform the same functions as
Additionally, to further evaluate the effectiveness of QUQ, we substitute
the accelerator designed for uniform quantization in [5, 6].
QUQ with uniform quantization while maintaining the rest of the PTQ
It can be seen that we only insert some conversion circuits with-
process unchanged, denoted as BaseQ. Table 2 shows the Top-1 accuracy
out altering the original data flow. QUA can be considered more as an
of different quantization methods across various models.
integration method rather than an architecture itself. Therefore, exist-
ing techniques for NN accelerators can also be employed to enhance
hardware efficiency. Table 2: Accuracy Comparison of Partially Quantized ViTs.

Method W/A ViT-S ViT-L DeiT-S DeiT-B Swin-T Swin-S

5 RELATED WORKS Original 32/32 81.39 85.84 79.80 81.80 81.39 83.23
BaseQ 6/6 69.73 80.96 72.55 78.94 78.44 82.04
PTQ4ViT [12] and APQ-ViT [2] focus on partially quantized ViT mod-
PTQ4ViT∗ 6/6 78.63 85.05 76.28 80.25 80.47 82.38
els, i.e., they do not consider input activations of some operations such APQ-ViT† 6/6 79.10 - 77.76 80.42 - 82.76
as residual connection and LayerNorm. Notably, PTQ4ViT uses twin QUQ 6/6 79.65 85.57 78.73 81.60 80.95 83.06
uniform quantization for specific activation values, which can be con- ∗ Some activations are not uniformly quantized.
sidered as a subset of QUQ. † Block-wise Hessian information is considered to optimize the parameters instead of the
BiScaled-FxP [4] records an index table to identify outliers for each layer-wise counterpart in others.
tensor and applies an additional scale factor to quantize them. While
BiScaled-FxP is effective in handling non-negative activations and sym- It can be seen that QUQ surpasses state-of-the-art quantization meth-
metrically distributed weights in CNNs, it is unsatisfactory when deal- ods for partial quantization of ViTs. Compared to the full precision
ing with data exhibiting diverse distribution patterns in ViT. Further- models, 6-bit QUQ results in less than 0.3% accuracy drop for ViT-L,
more, the index table introduces unpredictable overhead when there DeiT-B, and Swin-S, yet a larger drop for models with the same archi-
are numerous outliers to be indexed. tecture of a smaller scale (ViT-S, DeiT-S, and Swin-T).
FQ-ViT [7] enables a full quantization for ViT. However, it employs For full quantization, our method is compared with FQ-ViT [7] and
row-wise quantization for weights and specific activations, leading BiScaled-FxP [4] under 6-bit and 8-bit quantization, as shown in Table 3.
to distinct quantization parameters for different row vectors within a Since BiScaled-FxP conducts experiments only on CNNs, the relevant
matrix. Row-wise scheme incurs additional memory overhead and com- experimental results are reproduced based on the method described in
plexity to the computation and quantization, and may not be supported [4]. Note that the optimization techniques used in QUQ are also applied
by existing architectures [9]. to BiScaled-FxP.
I-Bert [5] and I-ViT [6] investigate integer-only inference for trans- Table 3 shows that QUQ exhibits a more pronounced advantage than
formers. It is noteworthy that, although they avoid floating-point opera- state-of-the-art works for fully quantized models. Although a more
tions at all stages in Figure 1, 32-bit integer activations are still required significant accuracy drop can be observed in 6-bit quantization, QUQ is,
to maintain accuracy. Consequently, there is no actual reduction in to the best of our knowledge, the first method that can produce usable
memory overhead. results in this case.
DAC ’24, June 23–27, 2024, San Francisco, CA, USA Xinkuang Geng et al.

Table 3: Accuracy Comparison of Fully Quantized ViTs. quadratic growth of PEs. Additionally, the increase in power mainly
stems from the additional registers required to pipeline 𝑛𝑠ℎ , which
Method W/A ViT-S ViT-L DeiT-S DeiT-B Swin-T Swin-S
further increases the clock load.
Original 32/32 81.39 85.84 79.80 81.80 81.39 83.23
It is noteworthy that 6-bit QUQ achieves significantly higher accu-
BaseQ 6/6 0.10 0.10 0.09 0.17 0.10 00.41 racy than 8-bit BaseQ across all models (see Table 3), yet results in
BiScaled-FxP 6/6 0.30 5.94 0.64 7.89 0.16 00.39
12.6%-16.8% and 3.7%-5.6% reductions in area and power consumption,
FQ-ViT† 6/6 9.92 6.86 60.14 68.84 36.25 70.17
QUQ 6/6 69.43 78.51 69.96 76.17 76.05 79.14
respectively. Moreover, reducing the quantization bit-width further
decreases the memory overhead.
BaseQ 8/8 1.00 2.44 66.26 31.37 55.90 40.06
BiScaled-FxP 8/8 72.37 70.70 78.40 73.73 80.04 77.41
FQ-ViT† 8/8 79.49 85.03 79.17 81.20 80.51 82.71 7 CONCLUSION
QUQ 8/8 80.49 85.73 79.40 81.40 81.24 83.25 To support efficient quantization for data with various distribution
† Weights and certain activations are quantized row-wise. characteristics, we propose quadruplet uniform quantization (QUQ). A
progressive relaxation algorithm is devised for QUQ to select suitable
quantization parameters. Furthermore, we encode the quantization
To evaluate the impact of QUQ on the attention mechanism in fully
results as quadruplet uniform bytes (QUBs) and design a quadruplet
quantized ViT, we select some images from the validation dataset of
uniform accelerator (QUA). The experimental results show that QUQ
ImageNet and visualize the attention maps, as shown in Figure 7. For
results in higher accuracy than state-of-the-art PTQ methods for ViT,
the 8-bit case, the attention of uniform quantization in crucial regions
especially for fully quantized models. While achieving higher accu-
begins to decrease, while the attention of QUQ remains relatively con-
racy, QUQ requires lower area, power, and memory than conventional
stant, compared to the original. For the 6-bit case, the attention of
uniform quantization.
uniform quantization is no longer activated, while QUQ still effectively
It is noteworthy that, besides ViTs, QUQ is inherently capable of
maintains attention in crucial regions.
effectively quantizing the other NN models. Given its compatibility
with uniform quantization and ease of deployment, QUQ can serve as
Original 8-bit BaseQ 8-bit QUQ 6-bit BaseQ 6-bit QUQ a versatile extension for any NN accelerator, which offers an additional
option for software and hardware co-optimization.

8 ACKNOWLEDGEMENTS
This work was supported in part by the National Key Research and
Development Program of China under grant 2022YFB4500200 and in
part by the National Natural Science Foundation of China under grant
number 62374108.

REFERENCES
[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A
Figure 7: Attention map visualization for ViT-S. large-scale hierarchical image database. In 2009 IEEE conference on computer vision
and pattern recognition. Ieee, 248–255.
[2] Yifu Ding, Haotong Qin, Qinghua Yan, Zhenhua Chai, Junjie Liu, Xiaolin Wei, and Xian-
glong Liu. 2022. Towards Accurate Post-Training Quantization for Vision Transformer.
6.2 Hardware Evaluation In Proceedings of the 30th ACM International Conference on Multimedia. 5380–5388.
We evaluate the hardware overhead of QUQ by comparing the proposed [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
QUA with the one for uniform quantization. The evaluation model is Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image
implemented based on the architecture depicted in Figure 6. Since recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
QUQ can utilize the same SFUs and scratchpad memory as uniform [4] Shubham Jain, Swagath Venkataramani, Vijayalakshmi Srinivasan, Jungwook Choi,
Kailash Gopalakrishnan, and Leland Chang. 2019. BiScaled-DNN: Quantizing long-
quantization, they are not taken into consideration. All designs are tailed datastructures with two scale factors for deep neural networks. In Proceedings
synthesized under a consistent constraint, leveraging Synopsys Design of the 56th Annual Design Automation Conference 2019. 1–6.
[5] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. 2021.
Compiler on a 28 nm CMOS technology, with power reported through I-bert: Integer-only bert quantization. In International conference on machine learning.
PrimeTime PX under a 500 MHz clock. The evaluation results are shown PMLR, 5506–5518.
in Table 4. [6] Zhikai Li and Qingyi Gu. 2023. I-vit: Integer-only quantization for efficient vision
transformer inference. In Proceedings of the IEEE/CVF International Conference on
Computer Vision. 17065–17075.
Table 4: Area and Power of Various NN Accelerators. [7] Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. 2021. Fq-vit:
Post-training quantization for fully quantized vision transformer. arXiv preprint
Method W/A 16 × 16 PE Array 64 × 64 PE Array arXiv:2111.13824 (2021).
[8] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and
Area(mm2 ) Power(mW) Area(mm2 ) Power(mW) Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted
BaseQ 6/6 0.148 52.4 2.205 701.3 windows. In Proceedings of the IEEE/CVF international conference on computer vision.
10012–10022.
QUQ 6/6 0.153 57.2 2.247 767.5
[9] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart
BaseQ 8/8 0.175 60.6 2.702 796.7 Van Baalen, and Tijmen Blankevoort. 2021. A white paper on neural network quanti-
QUQ 8/8 0.182 65.1 2.714 851.6 zation. arXiv preprint arXiv:2106.08295 (2021).
[10] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay-
rolles, and Hervé Jégou. 2021. Training data-efficient image transformers & dis-
tillation through attention. In International conference on machine learning. PMLR,
It shows that QUQ incurs marginal hardware overhead compared 10347–10357.
to uniform quantization, exhibiting less than 5% and 10% overheads in [11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
area and power, respectively, for the considered cases. Increasing the Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances
in neural information processing systems 30 (2017).
size of the PE array reduces the relative area overhead of the accelerator. [12] Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. 2022. Ptq4vit:
This phenomenon can be attributed to the decreasing proportion of Post-training quantization for vision transformers with twin uniform quantization. In
European Conference on Computer Vision. Springer, 191–207.
additional circuits required to support DUs and QUs, compared to the

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy