0% found this document useful (0 votes)
13 views28 pages

Gpumode Talk 20241026

The document discusses BitBLAS, a framework for efficient low-precision deep learning computing, focusing on mixed-precision computing and the challenges associated with it. It outlines the design, experiments, and insights related to hardware and software compatibility, as well as the need for advanced compilation techniques to optimize performance. Additionally, it highlights the importance of layout propagation and operator fusion in enhancing computational efficiency in deep learning tasks.

Uploaded by

cluelesssubho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

Gpumode Talk 20241026

The document discusses BitBLAS, a framework for efficient low-precision deep learning computing, focusing on mixed-precision computing and the challenges associated with it. It outlines the design, experiments, and insights related to hardware and software compatibility, as well as the need for advanced compilation techniques to optimize performance. Additionally, it highlights the importance of layout propagation and operator fusion in enhancing computational efficiency in deep learning tasks.

Uploaded by

cluelesssubho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

GPU MODE Community

BitBLAS: Enabling Efficient Low-Precision


Deep Learning Computing

Lei Wang (/leɪ wɑːŋ/)

leiwang1999@outlook.com
Oct 26, 2024
Outline

Background: Mixed-Precision Computing

Introduction: Design of BitBLAS/Ladder

Experiments (End2End/OP): NVIDIA/AMD

Tutorials in Jupyter: BitBLAS\Ladder\Tile Language


Larger Scale, Fewer Bits
LLAMA-65B
LLAMA-2-70B
LLAMA-3-400B
Stable Diffusion
FP32 FP16 FP8 MXFP INT4 FP4 INT1 ……

Conventional Quantization:
Recent research has pushed
the boundaries of low-bit !
Model Checkpoint bits
LLAMA-2-7B with FP16 LLAMA-7B 13 GB
precision requires at
8 SmoothQuant
LLAMA-13B 37 GB
least 14GB of memory to AutoGPTQ
LLAMA-30B 76 GB 4
host the model BitDistiller*
LLAMA-65B 122 GB
2 BitNet-1.58bits*
1 BitNet* OneBit
*represents research from MSRA
Challenges
Three Major Challenges
Unsupported numerical precision in software
New data types such as NF4/AF4/MXFP have emerged.

Unsupported compute inst. in hardware


Most Hardware ’ have FP16xINT4 unit.

Combination explosion and hard to optimize


Though vendors and developers has given attention.

Supports of Vendor Library and MLC

Hardware evolutions of Lower Precision Computing


Insights
Mixed-Precision GEMM Execution Flow
一 : 内存系统具有兼容性
Key Observation 1 C[M,N]@FP16=A[M,K]@FP16 X B[N,K]@FP8, M=2, N=2, K=4
How?
The memory system has compatibility.
2xint4

reinterpret
8bit storage 2xnf4 Can be reinterpreted
into arbitrary datatypes
Opaque fixed-
width data block 1xint8

The memory system can store any data type by converting these
custom data types into fixed-width opaque data blocks.

一 : 内存系统具有兼容性
Key Observation 2
Which?
The compute inst. has compatibility. Leverage
FP32 FP32
Compatibility

Losslessly compatible
FP16 INT4
with FP16/FP32.

FP16

Most custom data types can be losslessly converted into wider standard
data types supported by existing hardware computing units for processing.
Separate Datatype and Computing
with Machine Learning Compilation
Conventional MLC Like ML Compilation, Can we ..
Separate A Python Like DSL FP16
Ampere
Compute from
FP32 FP8
Schedule Volta
FP16 FP8
RDNA
INT8 INT4
CDNA
Intermediate representation … NF4

Activation …
TensorIR, MLIR … Datatype
Hardware
Weight Backend
Datatype

Transformation
Loop unroll, … We need a universe Type Representation to
hide the conversion and do efficient codegen.

Backend
Generate code or executables However, the performance of current machine
for different hardware learning compilation tasks is still unsatisfactory,
even under hardware-supported instructions.
Existing compilation systems fail to
fully utilize the performance of computing units
Simple memory accesses struggle to meet the demands of
MatMul Performance of MLC under various storage levels simultaneously.
RTX3090(Tensor Core)
GMEM: expect coalesced access

SMEM: expect free bank conflict

REG: align with instruction

AMOS,Tensor IR can only reach 60-80% performance of


cuBLAS.
A Swizzling Rule for 8-Bit Tensor Cores (NVIDIA GTC 2020)
Major Factors for Performance
It’s hard to get the rule
1) Efficient Tiling Existing MLC primitives
can handle Swizzle Inventor
Control the compute-to-memory ratio, cache usage (ASPLOS 2021)
size, and register size
Graphene
(ASPLOS 2023)
2) Utilize Bandwidth can not handle
Better Memory Access pattern

Insight: The Abstract needs to be aware of and manipulate the data layout of tensors!
Tensor-Centric System Abstractions
Four tTile Schedule Primitives
tTile slice(tTile, index, shape, output_shape); tTile Convert(tTile, scope, c_func);

INT4Bit FLOAT16

An example of using tTile Pad(tTile, pad_shape, pad_value); tTile TransformLayout (tTile, scope, index_map);

tTile to build a mixed


Precision Computing
expression:

An example scheduled executed plan with tTile schedule primitives on nvidia


gpus.
New Design Space
Example of our tTile-Graph abstraction for end2end optimization from LLAMA, enabling
more fine-grained control across operators and even different memory layers.

More detail, download:

These abstractions enlarge the scheduling space for DNN computation! OSDI ’ Ladder
Auto Normalize Computation into Hardware Instructions
Bit-nearest instruction matching
Matches the instruction type to be converted based on the
FMA FP32 instruction computation pattern and throughput.
Device Inst Data Type TFLOPS/OPS Expression
INT4 HFMA2 FP16 RTX 3090 DFMA FLOAT64 8.9 TFLOPS D[0] = A[0] * B[ 0] + C[0]
RTX 3090 FMA FLOAT32 35.6 TFLOPS D[0] = A[0] * B[0] + C[0]
Can be converted into RTX 3090 IMAD INT32 17.8 TOPS D[0] = A[0] * B[0] + C[0]
HMMA FP16 RTX 3090 HFMA2 FLOAT16 35.6 TFLOPS D[0:2] = A[0:2] * B[0:2] + C[0:2]
RTX 3090 DP4A INT8 71.2 TOPS D[0] = dot(A[0:4], B[0:4]) + C[0]
RTX 3090 HMMA.m16n8k16.f16 FLOAT16 142 TFLOPS D[0:16, 0:16] = dot(A[0:4], B[0:4]) + C[0]
Iterator-based auto expr normalization RTX 3090 IMMA.m16n8k32.s8 INT8 284 TOPS D[0:16, 0:16] = dot(A[0:4], B[0:4]) + C[0]

Example of normalizing conv2d into tensorcore inst. Tutorial: Auto Tensorize

Which enables us to explore if


a given customized op(conv,
stencil) can be tensorized by Fuse Fuse Fuse
target instruction.
Hardware Aligned Layout Propagation
Tile based program memory access Deduced Perfect Access Pattern

MMA Tile

Warp Tile
MMA Tile Warp Tile
The search space is vast, with possible Hardware-Aligned
combinations in the order of O(N!) ! ldmatrix: The 8 fp16 shared memory ldmatrix: The 8 values each
’ impossible to traverse all of them. a values accessed per thread.
b thread receives in its registers.
Optimal Layout T0 T16 T0 T1
T1 T17 T2 T3
Deduction
tDevice: Hardware abstraction T2
T3
Deduced Perfect Access Pattern
• Explicitly Define the preferred
access pattern for different T7 T23
16
T14 T15 T31

memory layers. T8
T9
T24
T25
T16 T17

• Explicitly Define the access


pattern for instructions in warp
8
SHARED MEMORY SHARED MEMORY
level. T15 T31 T30 T31
8
Hardware Aligned Layout Propagation
Hardware Aligned Layout Deduction The memory-intensive operator for
re-layout the input.
Define Computation with DSL (TIR) B[vi // 16, vj // 16, vi % 16, vj % 16] =
A[vi // 8 * 8 + vi % 4 * 2 + vj % 16 // 8, vj // 16 * 16 + vi % 8 // 4 * 8 + vj % 8]
@tvm.script.ir_module
B[vi // 16, vj // 16, vi % 16, vj % 16] =
class MyModule:
A[vi // 8 * 8 + vi % 4 * 2 + vj % 16 // 8, vj // 16 * 16 + vi % 8 // 4 * 8 + vj % 8]
@T.prim_func
def main(a: T.handle, b: T.handle, c: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [M, K], dtype="float16")
Compute-Intensive Op with Perfect Layout Access
B = T.match_buffer(b, [N, K], dtype="float16") @I.ir_module
C = T.match_buffer(c, [M, N], dtype="float16") class Module:
@T.prim_func

for i, j, k in T.grid(M, N, K):


Deduce def main(A: T.Buffer(), B: T.Buffer(), C: T.Buffer():
__fetch2shared()
with T.block("B"): for ax0, ax1, ax2, ax3 in T.grid(1024, 1024, 16, 16):
with T.block("A_shared_warp"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k]) v0, v1, v2, v3 = T.axis.remap("SSSS", [ax0, ax1, ax2, ax3])
with T.init(): A_shared_warp[v0, v1, v2 * 2 + v3 // 8, v3 % 8] = A_shared[v0, v1, v2, v3]
for ax0, ax1, ax2, ax3 in T.grid(1024, 1024, 16, 16):
C[vi, vj] = T.float16(0)
with T.block("B_shared_warp"):
C[vi, vj] = C[vi, vj] + \ v0, v1, v2, v3 = T.axis.remap("SSSS", [ax0, ax1, ax2, ax3])
A[vi, vk].astype("float16") * B[vj, B_shared_warp[v0, v1, v2 * 2 + v3 // 8, v3 % 8] = B_shared[v0, v1, v2, v3]
for ii, jj, kk, i, j, k in T.grid(1024, 1024, 1024, 16, 16, 16):
vk].astype("float16")
with T.block("B"):
vii, vjj, vkk, vi, vj, vk = T.axis.remap("SSRSSR", [ii, jj, kk, i, j, k])
with T.init():

Specify a “ - 9 ” C_warp[vii, vjj, vi % 8 * 4 + vj % 8 // 2, vj // 8 * 4 + vi // 8 * 2 + vj % 2]


= T.float16(0)
C_warp[vii, vjj, vi % 8 * 4 + vj % 8 // 2, vj // 8 * 4 + vi // 8 * 2 + vj % 2]
+= A_shared_warp[vii, vkk, vi * 2 + vk // 8, vk % 8]
Bottom-up hardware instruction selection * B_shared_warp[vjj, vkk, vj * 2 + vk // 8, vk % 8]

Depth Type Instructions for ax0, ax1 in T.grid(16384, 16384):


with T.block("C_warp"):
0 Compute 2xmma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 v0, v1 = T.axis.remap("SS", [ax0, ax1])
C[v0, v1] = C_warp[v0 // 16, v1 // 16,
1 Shared Load ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16
2 Shared Store st.shared.v4.u32
3 Global Load ld.global.v4.u32
Advantages and Limitations
• Advantages: Eliminates the search space for data layout in tensor
scheduling, requiring only derivation.
• Limitations: Requires pre-conversion of data layout, which introduces12
conversion overhead.
Resolve the Limitation with Tile-Graph

Compute-intensive operators and memory-intensive Compute-intensive operators are connected through


operators are connected through registers shared memory.
O DI’ :W : High Performance Operator Fusion with Tile-Graph
Latency Hiding Method Based on Tile-Graph Constant Folding for Static Weights: Arrange weights during the
compilation phase to hide latency.
Forward Propagation of Data Layout Between Operators: The
preceding operator can process and write back data directly in the
layout expected by the subsequent operator during execution,
thereby avoiding additional data layout conversion operations
between the two operators.
Discussion: The performance Impact of introducing
Layout Transformation Fusion.
Why we need to introduce Layout Propagation?
Example of a Compute Flow
With Layout Propagation
Challenges weight
A B

1. The dimensions of the instructions


and computations do not align.
im2col

2. There are several peripheral SMEM Copy A dequant


computations outside the core MMA propagate
Copy B
instructions.
Deduced Layout Deduced Layout
3. Complex mapping relationships
Deduced Layout MMA
introduced by nonlinear
from Ladder
transformations (dequant, group-
scale).
Im2col and dequant will transform the layout as well

The deduced layout should be able to propagate across different compute blocks !
Methodology: Three different layout propagate modes
Case 1: Linear Transformation Case 2: Compressed Transformation
Transpose as an example Dequantize as an example
qweight INT4 INT4 INT4 INT4

weight FLOAT16 FLOAT16 FLOAT16 FLOAT16

lambda i, j: (i // 8 * 8 + j // 8 * 4 + i % 8 lambda i, j: (i // 8 * 8 + j // 8 * 4 + i %
// 2, i % 2 * 8 + j % 8) 8 // 2, i % 2 * 8 + j % 8)

Propagate Propagate

lambda i, j: (j // 8 * 8 + i // 8 * 4 + j % 8 lambda i, j: (i // 8 * 8 + j * 4 + i % 8 //
// 2, j % 2 * 8 + i % 8) 2, i % 2)
Methodology: Three different layout propagate modes
Case 3: non-injective Transformation
Dequantize as an example

Group wise scaling F16 F16

BitBLAS implements auto-layout propagation rules based on three patterns.

BitBLAS implements auto-layout propagation


rules based on three patterns.

Quant weight INT4 INT4 INT4 INT4

lambda i, j: (i // 8 * 8 + j // 8 * 4 + i % 8
// 2, i % 2 * 8 + j % 8)

Cannot Propagate
Resolve Conflict: Layout Auto Differentiation
Layout Conflict with correlative Buffers Resolve Conflict With CSE

Propagated Layout Non-injective! Add Conflict Add


B[j, k] = B(f(j, k)) ’ k)] = S [j, k] tTile A tTile A
Conflict tTile A
tTile B Conv Conv
tTile ’ + tTile B
B Scale
Add 10% performance Add
A Comparable perf with TensorRT
dequant
SMEM
Layout Auto Differentiation
Copy A Copy B

MMA

’ represents inverse transformation of Layout.


Latency-Oriented Optimization Search Policy
The abstraction enlarges the scheduling space for DNN computation and opens a new trade-off
between memory footprint efficiency and latency efficiency.

When the storage of the system is sufficient, additional searches are made for the latency
overhead of performing type conversions at each stage and the configuration with the shortest
latency is selected
System Overview of Ladder/BitBLAS
Ladder System for end2end optimizations BitBLAS
Example Usage Runtime Kernel Library

matmul_config = bitblas.MatmulConfig(
A_dtype="float16",
W_dtype="int4",
accum_dtype="float16",
out_dtype="float16", …
)
Matmul = bitblas.matmul(matmul_config)

Auto Tensorization
Operator CUDA Executable
Layout Propataion
Source
Hardware aware Tuning

Now integrated into vLLM, AutoGPTQ,


EfficientQAT, HQQ!
Has been used for 1.58bits Model Shipped to Ads team for BitNet deployment!
Vectorized Dequantization with Weight Interleave
Conventional Dequantization Vectorized Dequantization Tutorial: Fast Dequantize

32bit = 8xint4b
int4b

Logic Inst.
int32
8xint32b or 8xint8b

Type Convert Instructions While ’ hard to be extended


float16
8xfloat16
into fewer bits
9
Introducing a certain amount of computation can become a
bottleneck in performance Especially on devices with fewer bits
and weaker compute cores (for example, cuda core on a100).
BitBLAS: Chunk Level Interleave

Extension for BitBLAS To Support More Fewer Bits (1/2b to 8/16b)


And we also provide Other Fast Dequantize: FP8->FP16
Fast Decoding Performance on A100 GPU
Fast and Efficient Dynamic Kernel Tuning
Kernel_256x256x32
Building a universal library is challenging. Kernel_16x64x32
M=26
Different shapes (architectures) prefer different kernel config. N=1024,K=1024
Kernel_32x128x32
Only one dimension is dynamic within

LLM Compute Workload.

Kernel Database

Tutorial: Fast Codegen


Tutorial: Dynamic Codegen
End2End Performance of Ladder A100 80G

- WFP16AFP16 : ~ 1.1x/1.1x avg. speedup over Welder/TensorRT


- WINT4AFP16 (GPTQ) ~ 2.3x avg. speedup over vLLM-WINT4AFP16
- WINT1AINT8 (BitNet): up to 8.8x speedup over Ladder- WFP16AFP16 (on BLOOM-176B-BS1SEQ1)

24
Operator Performance of BitBLAS
System Performance Scaling Up

Decode: Memory Intensive Prefill Compute Intensive


Quantized kernels can benefit from Quantized kernels can benefit from more
reduced memory bandwidth usage. efficient hardware instructions.

26
Summary Challenges From The Community

Though bitblas has been integrated into


We proposed universe Tensor Abstractions and vLLM, AutoGPTQ, HQQ
Schedule Primitives to enable ml compiler
explore tensor scheduling Kernel Compilation takes too much time even
though with Kernel Database.
Runtime Kernel Library may lead to uncomfortable user
We proposed a hardware-aligned Memory experience.
Layout Propagation Strategy to auto inference
Memory Layout and eliminate the overhead. Schedule Based Implementations make it hard
for developers to extend BitBLAS.
We proposed a bit-nearest and instruction
aligned tensorization strategy. Schedule Based Implementation is hard to
describe complex ops(like stream-k, flash Atten)
We introduce a Latency-oriented Search Policy
’ ing TileLang to handle issue 2 and 3
as triton is hard to describe dequant related items.
We designed Ladder and BitBLAS.
Tutorials

Tutorials
More info, reproduce, reach:
https://github.com/microsoft/BitBLAS
Thanks for watching
More detail, download:
OSDI ’ Ladder

Oct 26, 2024

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy