0% found this document useful (0 votes)

24 views11 pages

MegaKernel Blog

The document discusses the development of a low-latency megakernel for the Llama-1B model, aimed at optimizing performance on modern GPUs for applications requiring fast responses. By merging multiple small kernels into a single megakernel, the authors achieved a significant increase in memory bandwidth utilization and reduced latency, outperforming existing inference engines. The post details the challenges faced in kernel fusion, memory management, and synchronization, ultimately demonstrating the effectiveness of their approach with impressive performance metrics.

Uploaded by

lry89757

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views11 pages

MegaKernel Blog

Uploaded by

lry89757

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

5/31/25, 3:09 PM Look Ma, No Bubbles!

Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

Hazy Research People Blog

May 27, 2025 · 13 min read

Look Ma, No Bubbles! Designing a Low-

Latency Megakernel for Llama-1B
Benjamin Spector*, Jordan Juravsky*, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu,
Simran Arora, Chris Ré

There are some applications that benefit from running LLMs really, really fast. This
low-latency regime encompasses applications like chatbots and human-in-the-
loop workflows, where users care a lot about seeing responses come back
immediately.

Given the importance of these low-latency workloads, we wanted to explore just

how fast we can run open-source models on modern GPUs. To really stress-test
existing systems, we consider an aggressive low-latency scenario where we
generate a single sequence with Llama-3.2-1B. This workload is strongly memory
bound – our performance is dominated by how fast we can load model weights
from GPU global memory.

It turns out that popular LLM inference engines – vLLM and SGLang – are only
able to use at most 50% of available GPU bandwidth when running this workload
on an H100. The root of the problem, which we'll describe more below, is that
existing systems break down a model forward pass into around a hundred
separate kernels that each implement a few operations (e.g. RMS norm,
attention, an MLP layer + activation, rotary). Each kernel comes with a setup and
teardown period and during this time no useful work gets done – for instance, the
all-important task of loading model weights is stalled.

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 1/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

Figure 1: Speed! Results generated with a 32-token prompt and 128 generated
tokens, with no speculation

In this post, we show how we can bypass this problem by merging the entire
Llama-1B forward pass into a single "megakernel" that eliminates kernel
boundaries altogether. Doing this achieves brr – on an H100, we use 78% of
memory bandwidth and outperform existing systems by over 1.5x. (To our
knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In
the rest of this post, we'll walk through how and why one would do this.
Specifically:

– First, we'll talk about how small kernels lead to AI systems that underutilize the
GPU's full bandwidth.
– Second, we'll describe three important points about how we built our
megakernel: how we fused lots of kernels together, how we share hardware
resources across them to minimize overhead, and how we synchronize them
efficiently.

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 2/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

If you're interested in learning more of the details or using these ideas yourself,
we're open-sourcing all of our code here.

Separate Kernels Kill the Vibe

In general, the way one runs code on a GPU is by launching a "kernel" – a small
program that does a well-defined operation (e.g. RMS norm, MLP). Today, all AI
workloads run as long sequences of relatively small kernels. To get an initial sense,
let's look at the operations in the Llama-1B transformer block, and some example
kernel boundaries of how they might be divided up (Figure 2).

Figure 2: An example set of kernel boundaries for the Llama-1B transformer block.
Red boxes delineate the work done by individual kernels.

As we described earlier, decoding a single sequence with Llama-1B is a purely

memory-bound workload: our performance depends on being able to always be
loading weights from GPU global memory. So, why are existing approaches so far
from using the full bandwidth of the GPU?

When we dug into it, we noticed a key problem was that the current kernel-
based approach to running models introduces stalls that prevent us from
constantly loading memory:

– First: GPU kernels are launched with a strict ordering, so that a thread block in
one kernel can't start until all thread blocks in previous kernels have completely
finished. Consequently, every time we start a kernel, we have to wait for all the
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 3/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

straggler thread blocks from the prior one to finish. For example, if a kernel runs
512 thread blocks (like our Llama-1B down projection), but we only have 148
streaming multiprocessors (like on a B200), we end up with 80 empty SM's at
the end.

– Second, as we've previously highlighted, each kernel launch and teardown

incurs costs. In principle, NVIDIA's CUDA graphs can help hide costs, but by our
measurements they still leave a lot on the table. For a simple dummy kernel
(which dumps a start time, sleeps, and dumps an end time) on an H100, we find
that running on a CUDA stream incurs a launch cost of about 2.1 microseconds,
and with CUDA graphs the launch cost only decreases to around 1.3
microseconds – time spent with the GPU doing no useful work! We'd like to
have the GPU spend all of its time doing useful work.
– Finally, even after we start the next kernel, we still have to wait to load weights
and activations before any compute can start. These latencies leave the GPU
sitting idle for thousands of cycles! Ideally, we'd start loading the next weights
while the previous computations and stores are happening. NVIDIA has also
built a mechanism for this called Programmatic Dependent Launch (PDL),
which allows the next kernel to start preparing while the previous kernel is
running, but we found it still introduces unnecessary stalls because the PDL
synchronization mechanism (cudaGridDependencySynchronize) is very coarse.
For example, it means we have to wait for all queries, keys, and values to
complete in order to start attention, as opposed to starting heads as soon as
they are ready. We'll later show another specific case of where this is useful in
Llama-1B.

Taken together, these form the "memory pipeline bubbles" our title references –
and they represent a key reason that we're not always loading from memory.
For short operations, these pauses add up, wasting a huge chunk of potential
bandwidth. In part, this is because Llama-1B (actually 1.24B parameters) in batch
size 1 is just so... small: if each operation is really fast, then the time spent in-
between them really starts to matter.

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 4/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

To illustrate the magnitude of the problem: for single-sequence generation in 16-

bit precision on a single H100, the memory limit is 3.35TB/s / 2.48GB = ~1350
forward passes per second. But with 7 kernel launches per layer, and 16 layers,
even with an optimistic 5 us of stalling per kernel (counting stragglers, kernel
launch, and memory latencies), generation would run at just ~770 forward passes
per second. In practice, it's often worse. On low-latency workloads, GPUs spend
only a fraction of their time actually doing any useful work!

So while CUDA does provide some existing features (e.g. graphs, streams, PDL) to
partially solve these problems, we wanted to see if a different approach could
solve all of these problems, where we just fuse the entire model forward pass into
a single kernel.

How to Megakernel
Next, we'll show you how we fused a whole Llama forward pass into a single
kernel, and our methods for resolving three key problems:

1. Fusing dozens of operations is hard to do from scratch. We need a

mechanism for executing these operations within the megakernel.
2. In order to overlap multiple operations on the same hardware, we need to
prevent contention over limited resources, such as shared memory.
3. The GPU synchronizes after each kernel in the traditional kernel model.
Without kernels, we have to synchronize the GPU all by ourselves!

Let's start with the first issue:

Issue 1/3: Fusing Lots of Operations

Traditional kernel fusion generally merges just two or three operations together. In
contrast, we need to fuse about a hundred. Consequently, we need to have a
sensible abstraction for how we can actually program a megakernel.

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 5/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

Our approach is built on an on-GPU interpreter – essentially a more

sophisticated version of our infrastructure underlying ThunderMLA. Our
interpreter is designed such that each streaming multiprocessor (SM) within the
GPU receives a sequence of instructions (each implemented using the same
CUDA template) and executes them. We schedule each SM's instruction
sequence ahead of time on the Python side, and notably we can reuse each
schedule for hundreds of forward passes!

For our end-to-end Llama forwards pass megakernel, we define the following set
of instructions:

– A fused RMS norm & QKV & RoPE instruction.

– An attention computation instruction.

– An attention reduction instruction (for ThunderGQA on long sequences).

– An O-projection + residual instruction.

– A fused RMS norm & up-gate & SiLU instruction.

– A down-projection + residual instruction.

– An RMS norm & language modeling head instruction, for computing the final
token logits.

We implement each of these instructions using a common CUDA template (with

load, store, compute boilerplate functions), facilitating interoperability within our
interpreter framework.

Issue 2/3: Sharing Shared Memory to Eliminate

Memory Bubbles
The instruction-and-interpreter structure lets us cleanly organize our
megakernel. However, we haven't yet addressed the key issue: making sure that
model weights are always being loaded in order to maximize memory bandwidth
utilization.

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 6/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

The reason why a megakernel lets us solve this problem is that we can pipeline
memory loads across instructions: our interpreter will start loading the model
weights for an instruction as soon as it can, even if a previous instruction is still
finishing up (e.g. storing out its results to global memory). It's this tight
transitioning between instructions that minimizes the memory bubbles that
would otherwise appear if we launched multiple kernels.

However, there's a catch: loading the weights from global memory for the next
instruction doesn't do you much good if you have no place to put the data you
loaded! More precisely, all of our weight matrices are loaded from GPU global
memory into our SM's "shared memory" – NVIDIA's term for the fast memory on
each SM. Shared memory is a scarce resource on each SM, and we can't start a
load for a new instruction if a previous instruction is using all of it. This
necessitates a way to keep track of which instruction is using which piece of
shared memory and quickly transition shared memory to the next instruction
when the current instruction is done with it.

We accomplish this by paging shared memory. We first divide the first 213kB of
shared memory on an H100 into 13 16KiB pages, and use remaining shared
memory for special purposes, like storing instruction parameters. To use one of
these pages, instructions have to explicitly request and release them from the
interpreter. The interpreter automatically passes released pages to the next
instruction, allowing them to start issuing memory loads as early as shared
memory becomes available.

Issue 3/3: Synchronization

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 7/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

While megakernels let us minimize pipeline bubbles, they also introduce a new
problem: synchronization. The performance limitation with the normal many-
kernel execution model is that no thread blocks in a kernel can start until all
thread blocks in previous kernels are finished. However, it's precisely this property
that makes it easy to manage data dependencies. When a kernel launches, CUDA
guarantees that all of the kernel's input tensors have already been produced and
are safe to read from immediately.

With megakernels, we have no such guarantees: when an SM starts to execute a

new instruction, its inputs might not be ready! To address this, we explicitly
synchronize the instructions inside of our megakernel. We accomplish this with a
simple counter system. Before the megakernel launches, we initialize an array of
counters (i.e. integers) in GPU global memory with a starting value of zero.
Whenever an instruction completes, it increments one of these counters.
Similarly, whenever a new instruction starts, it must wait for some of these
counters to reach a target value, indicating that all of its dependencies have
finished.

One optimization this enables is in the big multi-layer perceptrons (MLPs) in

Llama-1B.

– In a naive implementation using PDL, one must await completing the whole
hidden state before beginning the down projection matrix multiply.

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 8/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

– We instead produce and consume the intermediate state in four chunks, each
with their own counter. This way, an instruction for the down projection only
needs to wait for its input chunk to finish.

Putting It All Together

To our knowledge, our H100 megakernel represents the first time anyone has run
the forward pass for a 16-bit 1B+ parameter language model in under one
millisecond on a GPU. Our B200 implementation pushes this even further to
under 680 microseconds per forward pass!

As shown in Figure 1, our megakernel outperforms vLLM and SGLang baselines

(which use CUDA graphs and torch compilation):

– On an H100, our megakernel runs almost 2.5x faster than vLLM and over 1.5x
faster than SGLang.

– On a B200, the gap with vLLM rises to over 3.5x, and we remain more than 1.5x
faster than SGLang, too.

We're still actually quite a ways off from the theoretical limit on a B200, which is
around ~3,000 forward passes per second. Part of this gap is because this
theoretical limit is based purely on memory bandwidth – but we still have to wait
to load activations. And although these activations are small (and don't cost a lot
of bandwidth), there are still latencies in loading them that we can't hide. A
breakdown of the runtime of our current B200 forward pass (total runtime 600
microseconds):

– 250 microseconds are spent storing activations, awaiting consistency, and

loading them. This is about 20% higher than a simple model would suggest:
since each instruction has a dependence on the last one, we need to pay two
load latencies (check ready, and then load activations) and two store latencies
(store activations, then mark ready) per instruction. Using ~500 nanoseconds
latency per load / store, this would impose about 200 microseconds of

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 9/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

overhead. (We suspect some of the remaining 50 microseconds comes from

time spent processing atomics in global memory.)
– 200 microseconds are spent actually running RMS norm and matrix-vector
computations. 95% of this portion is devoted to matrix-vector. On Blackwell,
we find that using the tensor cores is marginally helpful for this; on Hopper, we
find it better to simply run on the CUDA cores. This difference comes from the
fact that both GPUs have relatively similar CUDA core performance, but
Blackwell tensor cores are much faster.

– 30 microseconds are spent awaiting weights from global memory (pipelining

works!) Of these, 40% are spent in the LM head, which is the best-pipelined
part of the whole megakernel due to its homogeneity and huge size.

– 40 microseconds are spent on low-level synchronization overhead across

warps. A key issue here is that CUDA's asynchronous barriers are relatively slow,
even when they're already in the "pass" state, requiring about 60 nanoseconds
each time.
– 80 microseconds are on setup and various other overheads (e.g. passing
instruction barriers, marking pages as complete, etc.)

We think there's probably more to do on each of these, but that'll have to wait for
a future update!

The Megakernel Cinematic Universe

In this blog, we focus narrowly on designing a megakernel for low-latency, batch-
size one LLM inference. However, we believe that the ability to more precisely
control GPU execution with megakernels can more generally be applied to
accelerate a much broader set of AI workloads. Stay tuned!

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 10/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

The Main Message of this Blog Post

If you'd like to learn more, please reach out to Ben or Jordan! Please include a
tribute of at least five pictures of kittens in your email.

– Ben: bfs@stanford.edu

– Jordan: jbj@stanford.edu

And many, many thanks to Together AI for generously providing us with B200s
and H100s to do this work, which would not have been possible without them!

Made by Hazy Research. Learn more about the lab ↗

https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 11/11

2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
1 Cuda
100% (1)
1 Cuda
173 pages
7. Space Regainers
No ratings yet
7. Space Regainers
88 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
NTNU HetComp Topublish PDF
No ratings yet
NTNU HetComp Topublish PDF
83 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
Exercises On Complex Variables PDF
No ratings yet
Exercises On Complex Variables PDF
101 pages
upcrc_opencl_lec1
No ratings yet
upcrc_opencl_lec1
38 pages
Sample Proportions: Section 9.2
No ratings yet
Sample Proportions: Section 9.2
12 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
2407.00079v3
No ratings yet
2407.00079v3
23 pages
Chapter 4 Compute Architecture and s 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and s 2023 Programming Massively Parallel P
24 pages
Cooperative Kernels - GPU Multitasking For Blocking Algorithms (Fse2017)
No ratings yet
Cooperative Kernels - GPU Multitasking For Blocking Algorithms (Fse2017)
11 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Steel Works
No ratings yet
Steel Works
13 pages
Risk Assessment Guidelines For Tunnels
100% (1)
Risk Assessment Guidelines For Tunnels
10 pages
Rocks and Minerals Practice s2c PDF
No ratings yet
Rocks and Minerals Practice s2c PDF
13 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
2502.09334v1
No ratings yet
2502.09334v1
17 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
KVM Vs Vmware
No ratings yet
KVM Vs Vmware
4 pages
FACTORINGPOLYNOMIALS
No ratings yet
FACTORINGPOLYNOMIALS
21 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
FastDecode
No ratings yet
FastDecode
15 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
w13s1_MultiprocessingGPU
No ratings yet
w13s1_MultiprocessingGPU
21 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
C Interview Questions: Abdul Kalam
No ratings yet
C Interview Questions: Abdul Kalam
69 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
1
No ratings yet
1
44 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Grade 8 and 9 Item Banks For Revision
100% (1)
Grade 8 and 9 Item Banks For Revision
131 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Comsol
100% (1)
Comsol
34 pages
M.E. ENERGY ENGINEERING Syllabus
No ratings yet
M.E. ENERGY ENGINEERING Syllabus
44 pages
S62797 - LLM Inference Sizing_ Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing_ Benchmarking End-to-End Inference Systems
36 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Lte Handover Call Flow
100% (12)
Lte Handover Call Flow
12 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla
No ratings yet
Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla
56 pages
GPGPU
No ratings yet
GPGPU
139 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
ITT American Electric HPS Micro-Watt Flood Series M Spec Sheet 1-82
No ratings yet
ITT American Electric HPS Micro-Watt Flood Series M Spec Sheet 1-82
6 pages
Lec 1
No ratings yet
Lec 1
27 pages
AD7533
No ratings yet
AD7533
12 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Characterization of PLA/Bovine Bone Composite As A Candidate Material For Artificial Bone
No ratings yet
Characterization of PLA/Bovine Bone Composite As A Candidate Material For Artificial Bone
9 pages
16799
No ratings yet
16799
12 pages
Impact of The Application of Lean Manufacturing in Sausage Production Line Equipment in The Food Manufacturing Industry in Metropolitan Lima
No ratings yet
Impact of The Application of Lean Manufacturing in Sausage Production Line Equipment in The Food Manufacturing Industry in Metropolitan Lima
13 pages
CUDA
No ratings yet
CUDA
33 pages
PT8A977B
No ratings yet
PT8A977B
11 pages
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
No ratings yet
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
23 pages
Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong
No ratings yet
Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong
20 pages
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
ME Math 5 Q1 0202 SG
No ratings yet
ME Math 5 Q1 0202 SG
11 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
20.SAM64000-461-01 FUEL OIL SERVICE SYSTEM 燃油日用系统
No ratings yet
20.SAM64000-461-01 FUEL OIL SERVICE SYSTEM 燃油日用系统
17 pages
PHYSICS
No ratings yet
PHYSICS
3 pages
AECT210 Lecture 6
No ratings yet
AECT210 Lecture 6
7 pages
Revision (Unit 1) : A. Choose The Best Option
No ratings yet
Revision (Unit 1) : A. Choose The Best Option
6 pages
ANA CRM Maturity Model
100% (1)
ANA CRM Maturity Model
1 page
Computing Class Notes
No ratings yet
Computing Class Notes
4 pages
experiment no 18- KNO3
No ratings yet
experiment no 18- KNO3
2 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Console Shortcut Keys CHANGED: Keyboard (Scroll Lock) Must Be OFF
No ratings yet
Console Shortcut Keys CHANGED: Keyboard (Scroll Lock) Must Be OFF
3 pages
Diversificationofmarigoldmalesterilelinesforhybridseedproduction
No ratings yet
Diversificationofmarigoldmalesterilelinesforhybridseedproduction
2 pages
Big Data and Business Analytics
No ratings yet
Big Data and Business Analytics
3 pages
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
PSP Architecture: Architecture of Consoles: A Practical Analysis, #18
From Everand
PSP Architecture: Architecture of Consoles: A Practical Analysis, #18
Rodrigo Copetti
No ratings yet
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MegaKernel Blog

Uploaded by

MegaKernel Blog

Uploaded by

5/31/25, 3:09 PM Look Ma, No Bubbles!

Designing a Low-Latency Megakernel for Llama-1B · Hazy Research

Hazy Research People Blog

May 27, 2025 · 13 min read

Look Ma, No Bubbles! Designing a Low-

Given the importance of these low-latency workloads, we wanted to explore just

Separate Kernels Kill the Vibe

As we described earlier, decoding a single sequence with Llama-1B is a purely

– Second, as we've previously highlighted, each kernel launch and teardown

To illustrate the magnitude of the problem: for single-sequence generation in 16-

1. Fusing dozens of operations is hard to do from scratch. We need a

Let's start with the first issue:

Issue 1/3: Fusing Lots of Operations

Our approach is built on an on-GPU interpreter – essentially a more

– A fused RMS norm & QKV & RoPE instruction.

– An attention computation instruction.

– An attention reduction instruction (for ThunderGQA on long sequences).

– A fused RMS norm & up-gate & SiLU instruction.

– A down-projection + residual instruction.

We implement each of these instructions using a common CUDA template (with

Issue 2/3: Sharing Shared Memory to Eliminate

Issue 3/3: Synchronization

With megakernels, we have no such guarantees: when an SM starts to execute a

One optimization this enables is in the big multi-layer perceptrons (MLPs) in

Putting It All Together

As shown in Figure 1, our megakernel outperforms vLLM and SGLang baselines

– 250 microseconds are spent storing activations, awaiting consistency, and

overhead. (We suspect some of the remaining 50 microseconds comes from

– 30 microseconds are spent awaiting weights from global memory (pipelining

– 40 microseconds are spent on low-level synchronization overhead across

The Megakernel Cinematic Universe

The Main Message of this Blog Post

See also: pretty big kernels | regular kernels

Made by Hazy Research. Learn more about the lab ↗

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.