0% found this document useful (0 votes)

13 views28 pages

Gpumode Talk 20241026

The document discusses BitBLAS, a framework for efficient low-precision deep learning computing, focusing on mixed-precision computing and the challenges associated with it. It outlines the design, experiments, and insights related to hardware and software compatibility, as well as the need for advanced compilation techniques to optimize performance. Additionally, it highlights the importance of layout propagation and operator fusion in enhancing computational efficiency in deep learning tasks.

Uploaded by

cluelesssubho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views28 pages

Gpumode Talk 20241026

Uploaded by

cluelesssubho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

GPU MODE Community

BitBLAS: Enabling Efficient Low-Precision

Deep Learning Computing

Lei Wang (/leɪ wɑːŋ/)

leiwang1999@outlook.com
Oct 26, 2024
Outline

Background: Mixed-Precision Computing

Introduction: Design of BitBLAS/Ladder

Experiments (End2End/OP): NVIDIA/AMD

Tutorials in Jupyter: BitBLAS\Ladder\Tile Language

Larger Scale, Fewer Bits
LLAMA-65B
LLAMA-2-70B
LLAMA-3-400B
Stable Diffusion
FP32 FP16 FP8 MXFP INT4 FP4 INT1 ……

Conventional Quantization:
Recent research has pushed
the boundaries of low-bit !
Model Checkpoint bits
LLAMA-2-7B with FP16 LLAMA-7B 13 GB
precision requires at
8 SmoothQuant
LLAMA-13B 37 GB
least 14GB of memory to AutoGPTQ
LLAMA-30B 76 GB 4
host the model BitDistiller*
LLAMA-65B 122 GB
2 BitNet-1.58bits*
1 BitNet* OneBit
*represents research from MSRA
Challenges
Three Major Challenges
Unsupported numerical precision in software
New data types such as NF4/AF4/MXFP have emerged.

Unsupported compute inst. in hardware

Most Hardware ’ have FP16xINT4 unit.

Combination explosion and hard to optimize

Though vendors and developers has given attention.

Supports of Vendor Library and MLC

Hardware evolutions of Lower Precision Computing

Insights
Mixed-Precision GEMM Execution Flow
一 : 内存系统具有兼容性
Key Observation 1 C[M,N]@FP16=A[M,K]@FP16 X B[N,K]@FP8, M=2, N=2, K=4
How?
The memory system has compatibility.
2xint4

reinterpret
8bit storage 2xnf4 Can be reinterpreted
into arbitrary datatypes
Opaque fixed-
width data block 1xint8

The memory system can store any data type by converting these
custom data types into fixed-width opaque data blocks.

一 : 内存系统具有兼容性
Key Observation 2
Which?
The compute inst. has compatibility. Leverage
FP32 FP32
Compatibility

Losslessly compatible
FP16 INT4
with FP16/FP32.

FP16

Most custom data types can be losslessly converted into wider standard
data types supported by existing hardware computing units for processing.
Separate Datatype and Computing
with Machine Learning Compilation
Conventional MLC Like ML Compilation, Can we ..
Separate A Python Like DSL FP16
Ampere
Compute from
FP32 FP8
Schedule Volta
FP16 FP8
RDNA
INT8 INT4
CDNA
Intermediate representation … NF4
…
Activation …
TensorIR, MLIR … Datatype
Hardware
Weight Backend
Datatype

Transformation
Loop unroll, … We need a universe Type Representation to
hide the conversion and do efficient codegen.

Backend
Generate code or executables However, the performance of current machine
for different hardware learning compilation tasks is still unsatisfactory,
even under hardware-supported instructions.
Existing compilation systems fail to
fully utilize the performance of computing units
Simple memory accesses struggle to meet the demands of
MatMul Performance of MLC under various storage levels simultaneously.
RTX3090(Tensor Core)
GMEM: expect coalesced access

SMEM: expect free bank conflict

REG: align with instruction

AMOS，Tensor IR can only reach 60-80% performance of

cuBLAS.
A Swizzling Rule for 8-Bit Tensor Cores (NVIDIA GTC 2020)
Major Factors for Performance
It’s hard to get the rule
1) Efficient Tiling Existing MLC primitives
can handle Swizzle Inventor
Control the compute-to-memory ratio, cache usage (ASPLOS 2021)
size, and register size
Graphene
(ASPLOS 2023)
2) Utilize Bandwidth can not handle
Better Memory Access pattern

Insight: The Abstract needs to be aware of and manipulate the data layout of tensors!
Tensor-Centric System Abstractions
Four tTile Schedule Primitives
tTile slice(tTile, index, shape, output_shape); tTile Convert(tTile, scope, c_func);

INT4Bit FLOAT16

An example of using tTile Pad(tTile, pad_shape, pad_value); tTile TransformLayout (tTile, scope, index_map);

tTile to build a mixed

Precision Computing
expression:

An example scheduled executed plan with tTile schedule primitives on nvidia

gpus.
New Design Space
Example of our tTile-Graph abstraction for end2end optimization from LLAMA, enabling
more fine-grained control across operators and even different memory layers.

More detail, download:

These abstractions enlarge the scheduling space for DNN computation! OSDI ’ Ladder
Auto Normalize Computation into Hardware Instructions
Bit-nearest instruction matching
Matches the instruction type to be converted based on the
FMA FP32 instruction computation pattern and throughput.
Device Inst Data Type TFLOPS/OPS Expression
INT4 HFMA2 FP16 RTX 3090 DFMA FLOAT64 8.9 TFLOPS D[0] = A[0] * B[ 0] + C[0]
RTX 3090 FMA FLOAT32 35.6 TFLOPS D[0] = A[0] * B[0] + C[0]
Can be converted into RTX 3090 IMAD INT32 17.8 TOPS D[0] = A[0] * B[0] + C[0]
HMMA FP16 RTX 3090 HFMA2 FLOAT16 35.6 TFLOPS D[0:2] = A[0:2] * B[0:2] + C[0:2]
RTX 3090 DP4A INT8 71.2 TOPS D[0] = dot(A[0:4], B[0:4]) + C[0]
RTX 3090 HMMA.m16n8k16.f16 FLOAT16 142 TFLOPS D[0:16, 0:16] = dot(A[0:4], B[0:4]) + C[0]
Iterator-based auto expr normalization RTX 3090 IMMA.m16n8k32.s8 INT8 284 TOPS D[0:16, 0:16] = dot(A[0:4], B[0:4]) + C[0]

Example of normalizing conv2d into tensorcore inst. Tutorial: Auto Tensorize

Which enables us to explore if

a given customized op(conv,
stencil) can be tensorized by Fuse Fuse Fuse
target instruction.
Hardware Aligned Layout Propagation
Tile based program memory access Deduced Perfect Access Pattern

MMA Tile

Warp Tile
MMA Tile Warp Tile
The search space is vast, with possible Hardware-Aligned
combinations in the order of O(N!) ! ldmatrix: The 8 fp16 shared memory ldmatrix: The 8 values each
’ impossible to traverse all of them. a values accessed per thread.
b thread receives in its registers.
Optimal Layout T0 T16 T0 T1
T1 T17 T2 T3
Deduction
tDevice: Hardware abstraction T2
T3
Deduced Perfect Access Pattern
• Explicitly Define the preferred
access pattern for different T7 T23
16
T14 T15 T31

memory layers. T8
T9
T24
T25
T16 T17

• Explicitly Define the access

pattern for instructions in warp
8
SHARED MEMORY SHARED MEMORY
level. T15 T31 T30 T31
8
Hardware Aligned Layout Propagation
Hardware Aligned Layout Deduction The memory-intensive operator for
re-layout the input.
Define Computation with DSL (TIR) B[vi // 16, vj // 16, vi % 16, vj % 16] =
A[vi // 8 * 8 + vi % 4 * 2 + vj % 16 // 8, vj // 16 * 16 + vi % 8 // 4 * 8 + vj % 8]
@tvm.script.ir_module
B[vi // 16, vj // 16, vi % 16, vj % 16] =
class MyModule:
A[vi // 8 * 8 + vi % 4 * 2 + vj % 16 // 8, vj // 16 * 16 + vi % 8 // 4 * 8 + vj % 8]
@T.prim_func
def main(a: T.handle, b: T.handle, c: T.handle):
T.func_attr({"global_symbol": "main", "tir.noalias": True})
A = T.match_buffer(a, [M, K], dtype="float16")
Compute-Intensive Op with Perfect Layout Access
B = T.match_buffer(b, [N, K], dtype="float16") @I.ir_module
C = T.match_buffer(c, [M, N], dtype="float16") class Module:
@T.prim_func

for i, j, k in T.grid(M, N, K):

Deduce def main(A: T.Buffer(), B: T.Buffer(), C: T.Buffer():
__fetch2shared()
with T.block("B"): for ax0, ax1, ax2, ax3 in T.grid(1024, 1024, 16, 16):
with T.block("A_shared_warp"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k]) v0, v1, v2, v3 = T.axis.remap("SSSS", [ax0, ax1, ax2, ax3])
with T.init(): A_shared_warp[v0, v1, v2 * 2 + v3 // 8, v3 % 8] = A_shared[v0, v1, v2, v3]
for ax0, ax1, ax2, ax3 in T.grid(1024, 1024, 16, 16):
C[vi, vj] = T.float16(0)
with T.block("B_shared_warp"):
C[vi, vj] = C[vi, vj] + \ v0, v1, v2, v3 = T.axis.remap("SSSS", [ax0, ax1, ax2, ax3])
A[vi, vk].astype("float16") * B[vj, B_shared_warp[v0, v1, v2 * 2 + v3 // 8, v3 % 8] = B_shared[v0, v1, v2, v3]
for ii, jj, kk, i, j, k in T.grid(1024, 1024, 1024, 16, 16, 16):
vk].astype("float16")
with T.block("B"):
vii, vjj, vkk, vi, vj, vk = T.axis.remap("SSRSSR", [ii, jj, kk, i, j, k])
with T.init():

Specify a “ - 9 ” C_warp[vii, vjj, vi % 8 * 4 + vj % 8 // 2, vj // 8 * 4 + vi // 8 * 2 + vj % 2]

= T.float16(0)
C_warp[vii, vjj, vi % 8 * 4 + vj % 8 // 2, vj // 8 * 4 + vi // 8 * 2 + vj % 2]
+= A_shared_warp[vii, vkk, vi * 2 + vk // 8, vk % 8]
Bottom-up hardware instruction selection * B_shared_warp[vjj, vkk, vj * 2 + vk // 8, vk % 8]

Depth Type Instructions for ax0, ax1 in T.grid(16384, 16384):

with T.block("C_warp"):
0 Compute 2xmma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 v0, v1 = T.axis.remap("SS", [ax0, ax1])
C[v0, v1] = C_warp[v0 // 16, v1 // 16,
1 Shared Load ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16
2 Shared Store st.shared.v4.u32
3 Global Load ld.global.v4.u32
Advantages and Limitations
• Advantages: Eliminates the search space for data layout in tensor
scheduling, requiring only derivation.
• Limitations: Requires pre-conversion of data layout, which introduces12
conversion overhead.
Resolve the Limitation with Tile-Graph

Compute-intensive operators and memory-intensive Compute-intensive operators are connected through

operators are connected through registers shared memory.
O DI’ :W : High Performance Operator Fusion with Tile-Graph
Latency Hiding Method Based on Tile-Graph Constant Folding for Static Weights: Arrange weights during the
compilation phase to hide latency.
Forward Propagation of Data Layout Between Operators: The
preceding operator can process and write back data directly in the
layout expected by the subsequent operator during execution,
thereby avoiding additional data layout conversion operations
between the two operators.
Discussion: The performance Impact of introducing
Layout Transformation Fusion.
Why we need to introduce Layout Propagation?
Example of a Compute Flow
With Layout Propagation
Challenges weight
A B

1. The dimensions of the instructions

and computations do not align.
im2col

2. There are several peripheral SMEM Copy A dequant

computations outside the core MMA propagate
Copy B
instructions.
Deduced Layout Deduced Layout
3. Complex mapping relationships
Deduced Layout MMA
introduced by nonlinear
from Ladder
transformations (dequant, group-
scale).
Im2col and dequant will transform the layout as well

The deduced layout should be able to propagate across different compute blocks !
Methodology: Three different layout propagate modes
Case 1: Linear Transformation Case 2: Compressed Transformation
Transpose as an example Dequantize as an example
qweight INT4 INT4 INT4 INT4

weight FLOAT16 FLOAT16 FLOAT16 FLOAT16

lambda i, j: (i // 8 * 8 + j // 8 * 4 + i % 8 lambda i, j: (i // 8 * 8 + j // 8 * 4 + i %
// 2, i % 2 * 8 + j % 8) 8 // 2, i % 2 * 8 + j % 8)

Propagate Propagate

lambda i, j: (j // 8 * 8 + i // 8 * 4 + j % 8 lambda i, j: (i // 8 * 8 + j * 4 + i % 8 //
// 2, j % 2 * 8 + i % 8) 2, i % 2)
Methodology: Three different layout propagate modes
Case 3: non-injective Transformation
Dequantize as an example

Group wise scaling F16 F16

BitBLAS implements auto-layout propagation rules based on three patterns.

BitBLAS implements auto-layout propagation

rules based on three patterns.

Quant weight INT4 INT4 INT4 INT4

lambda i, j: (i // 8 * 8 + j // 8 * 4 + i % 8
// 2, i % 2 * 8 + j % 8)

Cannot Propagate
Resolve Conflict: Layout Auto Differentiation
Layout Conflict with correlative Buffers Resolve Conflict With CSE

Propagated Layout Non-injective! Add Conflict Add

B[j, k] = B(f(j, k)) ’ k)] = S [j, k] tTile A tTile A
Conflict tTile A
tTile B Conv Conv
tTile ’ + tTile B
B Scale
Add 10% performance Add
A Comparable perf with TensorRT
dequant
SMEM
Layout Auto Differentiation
Copy A Copy B

MMA

’ represents inverse transformation of Layout.

Latency-Oriented Optimization Search Policy
The abstraction enlarges the scheduling space for DNN computation and opens a new trade-off
between memory footprint efficiency and latency efficiency.

When the storage of the system is sufficient, additional searches are made for the latency
overhead of performing type conversions at each stage and the configuration with the shortest
latency is selected
System Overview of Ladder/BitBLAS
Ladder System for end2end optimizations BitBLAS
Example Usage Runtime Kernel Library

matmul_config = bitblas.MatmulConfig(
A_dtype="float16",
W_dtype="int4",
accum_dtype="float16",
out_dtype="float16", …
)
Matmul = bitblas.matmul(matmul_config)

Auto Tensorization
Operator CUDA Executable
Layout Propataion
Source
Hardware aware Tuning

Now integrated into vLLM, AutoGPTQ,

EfficientQAT, HQQ!
Has been used for 1.58bits Model Shipped to Ads team for BitNet deployment!
Vectorized Dequantization with Weight Interleave
Conventional Dequantization Vectorized Dequantization Tutorial: Fast Dequantize

32bit = 8xint4b
int4b

Logic Inst.
int32
8xint32b or 8xint8b

Type Convert Instructions While ’ hard to be extended

float16
8xfloat16
into fewer bits
9
Introducing a certain amount of computation can become a
bottleneck in performance Especially on devices with fewer bits
and weaker compute cores (for example, cuda core on a100).
BitBLAS: Chunk Level Interleave

Extension for BitBLAS To Support More Fewer Bits (1/2b to 8/16b)

And we also provide Other Fast Dequantize: FP8->FP16
Fast Decoding Performance on A100 GPU
Fast and Efficient Dynamic Kernel Tuning
Kernel_256x256x32
Building a universal library is challenging. Kernel_16x64x32
M=26
Different shapes (architectures) prefer different kernel config. N=1024,K=1024
Kernel_32x128x32
Only one dimension is dynamic within
…
LLM Compute Workload.

Kernel Database

Tutorial: Fast Codegen

Tutorial: Dynamic Codegen
End2End Performance of Ladder A100 80G

- WFP16AFP16 : ~ 1.1x/1.1x avg. speedup over Welder/TensorRT

- WINT4AFP16 (GPTQ) ~ 2.3x avg. speedup over vLLM-WINT4AFP16
- WINT1AINT8 (BitNet): up to 8.8x speedup over Ladder- WFP16AFP16 (on BLOOM-176B-BS1SEQ1)

24
Operator Performance of BitBLAS
System Performance Scaling Up

Decode: Memory Intensive Prefill Compute Intensive

Quantized kernels can benefit from Quantized kernels can benefit from more
reduced memory bandwidth usage. efficient hardware instructions.

26
Summary Challenges From The Community

Though bitblas has been integrated into

We proposed universe Tensor Abstractions and vLLM, AutoGPTQ, HQQ
Schedule Primitives to enable ml compiler
explore tensor scheduling Kernel Compilation takes too much time even
though with Kernel Database.
Runtime Kernel Library may lead to uncomfortable user
We proposed a hardware-aligned Memory experience.
Layout Propagation Strategy to auto inference
Memory Layout and eliminate the overhead. Schedule Based Implementations make it hard
for developers to extend BitBLAS.
We proposed a bit-nearest and instruction
aligned tensorization strategy. Schedule Based Implementation is hard to
describe complex ops(like stream-k, flash Atten)
We introduce a Latency-oriented Search Policy
’ ing TileLang to handle issue 2 and 3
as triton is hard to describe dequant related items.
We designed Ladder and BitBLAS.
Tutorials

Tutorials
More info, reproduce, reach:
https://github.com/microsoft/BitBLAS
Thanks for watching
More detail, download:
OSDI ’ Ladder

Oct 26, 2024

IEEE1068 Repair & Rewinding of Motors
100% (1)
IEEE1068 Repair & Rewinding of Motors
28 pages
Modal Analysis
No ratings yet
Modal Analysis
18 pages
Numerical Methods For Engineers, 6th Edition 2009 Chapra Canale
83% (12)
Numerical Methods For Engineers, 6th Edition 2009 Chapra Canale
976 pages
IEEE 1068 Prueba de Motores
No ratings yet
IEEE 1068 Prueba de Motores
28 pages
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR
No ratings yet
Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR
21 pages
Lec1. Linux On DE10 STD
No ratings yet
Lec1. Linux On DE10 STD
28 pages
Exame1 - 2022 - EN
No ratings yet
Exame1 - 2022 - EN
8 pages
COA Back
No ratings yet
COA Back
1 page
Embedded Systems
No ratings yet
Embedded Systems
292 pages
27.1.22-1.1.1 - D.global - Needs
No ratings yet
27.1.22-1.1.1 - D.global - Needs
1,644 pages
AI File
No ratings yet
AI File
22 pages
1.0 Characteristics of Contemporary Processors
No ratings yet
1.0 Characteristics of Contemporary Processors
1 page
Lab1 Getting Started (DEBUG)
No ratings yet
Lab1 Getting Started (DEBUG)
2 pages
PIC Microcontroller in Practice: Educational Engineering Team
100% (1)
PIC Microcontroller in Practice: Educational Engineering Team
74 pages
Edsin 51
No ratings yet
Edsin 51
34 pages
Exp 4
No ratings yet
Exp 4
7 pages
DC Lab Programs
No ratings yet
DC Lab Programs
11 pages
Addressing Modes 8051
No ratings yet
Addressing Modes 8051
37 pages
Exame1 2023 EN
No ratings yet
Exame1 2023 EN
4 pages
Deep Learning Mind Map PDF Download
No ratings yet
Deep Learning Mind Map PDF Download
1 page
Anuj
No ratings yet
Anuj
89 pages
Graphics and Multimedia
No ratings yet
Graphics and Multimedia
95 pages
AddisonWesley HighPerformanceC
No ratings yet
AddisonWesley HighPerformanceC
526 pages
19 Is 000003 BRO Symmetry Cracking Furnace Feb 2020
No ratings yet
19 Is 000003 BRO Symmetry Cracking Furnace Feb 2020
5 pages
Exp 4 Ya
No ratings yet
Exp 4 Ya
4 pages
Artificial Intelligent
No ratings yet
Artificial Intelligent
10 pages
10 Days Embedded and GUI
No ratings yet
10 Days Embedded and GUI
2 pages
MCSL 228
No ratings yet
MCSL 228
1 page
IP Address Classes MEC II
No ratings yet
IP Address Classes MEC II
7 pages
CSE1
No ratings yet
CSE1
1,491 pages
Process Engineer S Scope 1742986324
No ratings yet
Process Engineer S Scope 1742986324
1 page
L03 Perceptron Slides
No ratings yet
L03 Perceptron Slides
66 pages
20 Days Internship Agenda For Embedded Systems, GUI, Protocols and RTOS
No ratings yet
20 Days Internship Agenda For Embedded Systems, GUI, Protocols and RTOS
4 pages
Taufeeque Resume
No ratings yet
Taufeeque Resume
2 pages
Chat GPT
No ratings yet
Chat GPT
3 pages
Boe310 Lecture
No ratings yet
Boe310 Lecture
25 pages
Machine Learning Roadmap 2020
No ratings yet
Machine Learning Roadmap 2020
1 page
Ptp21a40018-Eb (Interfaces Library)
No ratings yet
Ptp21a40018-Eb (Interfaces Library)
44 pages
Design of Shaft For Bicycle
No ratings yet
Design of Shaft For Bicycle
1 page
Python Python For Data Science and Machine Learning
100% (4)
Python Python For Data Science and Machine Learning
165 pages
3 Par 1
No ratings yet
3 Par 1
702 pages
Tekla Checklist
No ratings yet
Tekla Checklist
24 pages
Srs Diagram
No ratings yet
Srs Diagram
4 pages
MCSL - 228 Solved Assignment
No ratings yet
MCSL - 228 Solved Assignment
37 pages
Image Processing 3
No ratings yet
Image Processing 3
1 page
11 Cbse Practical File
No ratings yet
11 Cbse Practical File
22 pages
Qt5类继承关系图
No ratings yet
Qt5类继承关系图
1 page
Igcse Ict 3ed TR Ws 2 3
No ratings yet
Igcse Ict 3ed TR Ws 2 3
1 page
Parallel Programming (Wilkinson)
No ratings yet
Parallel Programming (Wilkinson)
485 pages
ICT Test 1 G6
No ratings yet
ICT Test 1 G6
1 page
ANSYS 2023 R1 Capabilities Chart
No ratings yet
ANSYS 2023 R1 Capabilities Chart
69 pages
DSP Lab 2
No ratings yet
DSP Lab 2
7 pages
Software Lab Aditya
No ratings yet
Software Lab Aditya
22 pages
TY New
No ratings yet
TY New
1 page
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
NES Architecture: Architecture of Consoles: A Practical Analysis, #1
From Everand
NES Architecture: Architecture of Consoles: A Practical Analysis, #1
Rodrigo Copetti
5/5 (1)
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
From Everand
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
Rodrigo Copetti
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
Robotics
No ratings yet
Robotics
18 pages
6 - Data Link Layer Framming With Key
No ratings yet
6 - Data Link Layer Framming With Key
5 pages
Instruction Set
No ratings yet
Instruction Set
1 page
CV Preey Shah PDF
No ratings yet
CV Preey Shah PDF
1 page
History of Neural Networks
No ratings yet
History of Neural Networks
4 pages
Viva Questions and Answers For Computer Organization
0% (1)
Viva Questions and Answers For Computer Organization
2 pages
VLSI Physical Design: From Graph Partitioning To Timing Closure
No ratings yet
VLSI Physical Design: From Graph Partitioning To Timing Closure
30 pages
MET BTech Common Counseling 2024 Cutoff Ranks Round 3
No ratings yet
MET BTech Common Counseling 2024 Cutoff Ranks Round 3
2 pages
Architectural Patterns For Microservices
No ratings yet
Architectural Patterns For Microservices
12 pages
1z0 931 Exam Edited
No ratings yet
1z0 931 Exam Edited
15 pages
RKPL 2019 - Introduction To SE (Pert. 2)
No ratings yet
RKPL 2019 - Introduction To SE (Pert. 2)
37 pages
Computer Fundamentals and Programming: Course Description
No ratings yet
Computer Fundamentals and Programming: Course Description
2 pages
Variables 04 Feb 2025
No ratings yet
Variables 04 Feb 2025
6 pages
Order Picking Pt2
No ratings yet
Order Picking Pt2
78 pages
Das Syllabus
100% (1)
Das Syllabus
2 pages
Network Security - Unit 1: - Introduction
No ratings yet
Network Security - Unit 1: - Introduction
43 pages
Synopsis of Charity Management System
100% (1)
Synopsis of Charity Management System
7 pages
Compal La-4031p r0.2 Schematics
No ratings yet
Compal La-4031p r0.2 Schematics
43 pages
Azure Privileged Identity Management-Adoption Kit
No ratings yet
Azure Privileged Identity Management-Adoption Kit
9 pages
Asm 04
No ratings yet
Asm 04
3 pages
Paper 2 (Practical Programming Project)
No ratings yet
Paper 2 (Practical Programming Project)
8 pages
Value Type VS Reference Type in C#
No ratings yet
Value Type VS Reference Type in C#
3 pages
Radar System Object Detection
No ratings yet
Radar System Object Detection
9 pages
Top 30 SDLC Interview Questions and Answers
100% (1)
Top 30 SDLC Interview Questions and Answers
12 pages
Lab Manual 19
No ratings yet
Lab Manual 19
2 pages
Artificial Intelligence (AI), Is
No ratings yet
Artificial Intelligence (AI), Is
2 pages
OLT Onu Radio - Drawio
No ratings yet
OLT Onu Radio - Drawio
1 page
Hands-On Lab 7 - Using Pivot Tables
100% (1)
Hands-On Lab 7 - Using Pivot Tables
4 pages
A89307 Datasheet
No ratings yet
A89307 Datasheet
36 pages
Survey of PERL - Group 6 Presentation
No ratings yet
Survey of PERL - Group 6 Presentation
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Gpumode Talk 20241026

Uploaded by

Gpumode Talk 20241026

Uploaded by

GPU MODE Community

BitBLAS: Enabling Efficient Low-Precision

Lei Wang (/leɪ wɑːŋ/)

Background: Mixed-Precision Computing

Introduction: Design of BitBLAS/Ladder

Experiments (End2End/OP): NVIDIA/AMD

Tutorials in Jupyter: BitBLAS\Ladder\Tile Language

Unsupported compute inst. in hardware

Combination explosion and hard to optimize

Supports of Vendor Library and MLC

Hardware evolutions of Lower Precision Computing

SMEM: expect free bank conflict

REG: align with instruction

AMOS，Tensor IR can only reach 60-80% performance of

tTile to build a mixed

An example scheduled executed plan with tTile schedule primitives on nvidia

More detail, download:

Example of normalizing conv2d into tensorcore inst. Tutorial: Auto Tensorize

Which enables us to explore if

• Explicitly Define the access

for i, j, k in T.grid(M, N, K):

Specify a “ - 9 ” C_warp[vii, vjj, vi % 8 * 4 + vj % 8 // 2, vj // 8 * 4 + vi // 8 * 2 + vj % 2]

Depth Type Instructions for ax0, ax1 in T.grid(16384, 16384):

Compute-intensive operators and memory-intensive Compute-intensive operators are connected through

1. The dimensions of the instructions

2. There are several peripheral SMEM Copy A dequant

weight FLOAT16 FLOAT16 FLOAT16 FLOAT16

Group wise scaling F16 F16

BitBLAS implements auto-layout propagation rules based on three patterns.

BitBLAS implements auto-layout propagation

Quant weight INT4 INT4 INT4 INT4

Propagated Layout Non-injective! Add Conflict Add

’ represents inverse transformation of Layout.

Now integrated into vLLM, AutoGPTQ,

Type Convert Instructions While ’ hard to be extended

Extension for BitBLAS To Support More Fewer Bits (1/2b to 8/16b)

Tutorial: Fast Codegen

- WFP16AFP16 : ~ 1.1x/1.1x avg. speedup over Welder/TensorRT

Decode: Memory Intensive Prefill Compute Intensive

Though bitblas has been integrated into

Oct 26, 2024

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.