Gpumode Talk 20241026
Gpumode Talk 20241026
leiwang1999@outlook.com
Oct 26, 2024
Outline
Conventional Quantization:
Recent research has pushed
the boundaries of low-bit !
Model Checkpoint bits
LLAMA-2-7B with FP16 LLAMA-7B 13 GB
precision requires at
8 SmoothQuant
LLAMA-13B 37 GB
least 14GB of memory to AutoGPTQ
LLAMA-30B 76 GB 4
host the model BitDistiller*
LLAMA-65B 122 GB
2 BitNet-1.58bits*
1 BitNet* OneBit
*represents research from MSRA
Challenges
Three Major Challenges
Unsupported numerical precision in software
New data types such as NF4/AF4/MXFP have emerged.
reinterpret
8bit storage 2xnf4 Can be reinterpreted
into arbitrary datatypes
Opaque fixed-
width data block 1xint8
The memory system can store any data type by converting these
custom data types into fixed-width opaque data blocks.
一 : 内存系统具有兼容性
Key Observation 2
Which?
The compute inst. has compatibility. Leverage
FP32 FP32
Compatibility
Losslessly compatible
FP16 INT4
with FP16/FP32.
FP16
Most custom data types can be losslessly converted into wider standard
data types supported by existing hardware computing units for processing.
Separate Datatype and Computing
with Machine Learning Compilation
Conventional MLC Like ML Compilation, Can we ..
Separate A Python Like DSL FP16
Ampere
Compute from
FP32 FP8
Schedule Volta
FP16 FP8
RDNA
INT8 INT4
CDNA
Intermediate representation … NF4
…
Activation …
TensorIR, MLIR … Datatype
Hardware
Weight Backend
Datatype
Transformation
Loop unroll, … We need a universe Type Representation to
hide the conversion and do efficient codegen.
Backend
Generate code or executables However, the performance of current machine
for different hardware learning compilation tasks is still unsatisfactory,
even under hardware-supported instructions.
Existing compilation systems fail to
fully utilize the performance of computing units
Simple memory accesses struggle to meet the demands of
MatMul Performance of MLC under various storage levels simultaneously.
RTX3090(Tensor Core)
GMEM: expect coalesced access
Insight: The Abstract needs to be aware of and manipulate the data layout of tensors!
Tensor-Centric System Abstractions
Four tTile Schedule Primitives
tTile slice(tTile, index, shape, output_shape); tTile Convert(tTile, scope, c_func);
INT4Bit FLOAT16
An example of using tTile Pad(tTile, pad_shape, pad_value); tTile TransformLayout (tTile, scope, index_map);
These abstractions enlarge the scheduling space for DNN computation! OSDI ’ Ladder
Auto Normalize Computation into Hardware Instructions
Bit-nearest instruction matching
Matches the instruction type to be converted based on the
FMA FP32 instruction computation pattern and throughput.
Device Inst Data Type TFLOPS/OPS Expression
INT4 HFMA2 FP16 RTX 3090 DFMA FLOAT64 8.9 TFLOPS D[0] = A[0] * B[ 0] + C[0]
RTX 3090 FMA FLOAT32 35.6 TFLOPS D[0] = A[0] * B[0] + C[0]
Can be converted into RTX 3090 IMAD INT32 17.8 TOPS D[0] = A[0] * B[0] + C[0]
HMMA FP16 RTX 3090 HFMA2 FLOAT16 35.6 TFLOPS D[0:2] = A[0:2] * B[0:2] + C[0:2]
RTX 3090 DP4A INT8 71.2 TOPS D[0] = dot(A[0:4], B[0:4]) + C[0]
RTX 3090 HMMA.m16n8k16.f16 FLOAT16 142 TFLOPS D[0:16, 0:16] = dot(A[0:4], B[0:4]) + C[0]
Iterator-based auto expr normalization RTX 3090 IMMA.m16n8k32.s8 INT8 284 TOPS D[0:16, 0:16] = dot(A[0:4], B[0:4]) + C[0]
MMA Tile
Warp Tile
MMA Tile Warp Tile
The search space is vast, with possible Hardware-Aligned
combinations in the order of O(N!) ! ldmatrix: The 8 fp16 shared memory ldmatrix: The 8 values each
’ impossible to traverse all of them. a values accessed per thread.
b thread receives in its registers.
Optimal Layout T0 T16 T0 T1
T1 T17 T2 T3
Deduction
tDevice: Hardware abstraction T2
T3
Deduced Perfect Access Pattern
• Explicitly Define the preferred
access pattern for different T7 T23
16
T14 T15 T31
memory layers. T8
T9
T24
T25
T16 T17
The deduced layout should be able to propagate across different compute blocks !
Methodology: Three different layout propagate modes
Case 1: Linear Transformation Case 2: Compressed Transformation
Transpose as an example Dequantize as an example
qweight INT4 INT4 INT4 INT4
lambda i, j: (i // 8 * 8 + j // 8 * 4 + i % 8 lambda i, j: (i // 8 * 8 + j // 8 * 4 + i %
// 2, i % 2 * 8 + j % 8) 8 // 2, i % 2 * 8 + j % 8)
Propagate Propagate
lambda i, j: (j // 8 * 8 + i // 8 * 4 + j % 8 lambda i, j: (i // 8 * 8 + j * 4 + i % 8 //
// 2, j % 2 * 8 + i % 8) 2, i % 2)
Methodology: Three different layout propagate modes
Case 3: non-injective Transformation
Dequantize as an example
lambda i, j: (i // 8 * 8 + j // 8 * 4 + i % 8
// 2, i % 2 * 8 + j % 8)
Cannot Propagate
Resolve Conflict: Layout Auto Differentiation
Layout Conflict with correlative Buffers Resolve Conflict With CSE
MMA
When the storage of the system is sufficient, additional searches are made for the latency
overhead of performing type conversions at each stage and the configuration with the shortest
latency is selected
System Overview of Ladder/BitBLAS
Ladder System for end2end optimizations BitBLAS
Example Usage Runtime Kernel Library
matmul_config = bitblas.MatmulConfig(
A_dtype="float16",
W_dtype="int4",
accum_dtype="float16",
out_dtype="float16", …
)
Matmul = bitblas.matmul(matmul_config)
Auto Tensorization
Operator CUDA Executable
Layout Propataion
Source
Hardware aware Tuning
32bit = 8xint4b
int4b
Logic Inst.
int32
8xint32b or 8xint8b
Kernel Database
24
Operator Performance of BitBLAS
System Performance Scaling Up
26
Summary Challenges From The Community
Tutorials
More info, reproduce, reach:
https://github.com/microsoft/BitBLAS
Thanks for watching
More detail, download:
OSDI ’ Ladder