MegaKernel Blog
MegaKernel Blog
There are some applications that benefit from running LLMs really, really fast. This
low-latency regime encompasses applications like chatbots and human-in-the-
loop workflows, where users care a lot about seeing responses come back
immediately.
It turns out that popular LLM inference engines – vLLM and SGLang – are only
able to use at most 50% of available GPU bandwidth when running this workload
on an H100. The root of the problem, which we'll describe more below, is that
existing systems break down a model forward pass into around a hundred
separate kernels that each implement a few operations (e.g. RMS norm,
attention, an MLP layer + activation, rotary). Each kernel comes with a setup and
teardown period and during this time no useful work gets done – for instance, the
all-important task of loading model weights is stalled.
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 1/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
Figure 1: Speed! Results generated with a 32-token prompt and 128 generated
tokens, with no speculation
In this post, we show how we can bypass this problem by merging the entire
Llama-1B forward pass into a single "megakernel" that eliminates kernel
boundaries altogether. Doing this achieves brr – on an H100, we use 78% of
memory bandwidth and outperform existing systems by over 1.5x. (To our
knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In
the rest of this post, we'll walk through how and why one would do this.
Specifically:
– First, we'll talk about how small kernels lead to AI systems that underutilize the
GPU's full bandwidth.
– Second, we'll describe three important points about how we built our
megakernel: how we fused lots of kernels together, how we share hardware
resources across them to minimize overhead, and how we synchronize them
efficiently.
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 2/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
If you're interested in learning more of the details or using these ideas yourself,
we're open-sourcing all of our code here.
Figure 2: An example set of kernel boundaries for the Llama-1B transformer block.
Red boxes delineate the work done by individual kernels.
When we dug into it, we noticed a key problem was that the current kernel-
based approach to running models introduces stalls that prevent us from
constantly loading memory:
– First: GPU kernels are launched with a strict ordering, so that a thread block in
one kernel can't start until all thread blocks in previous kernels have completely
finished. Consequently, every time we start a kernel, we have to wait for all the
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 3/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
straggler thread blocks from the prior one to finish. For example, if a kernel runs
512 thread blocks (like our Llama-1B down projection), but we only have 148
streaming multiprocessors (like on a B200), we end up with 80 empty SM's at
the end.
Taken together, these form the "memory pipeline bubbles" our title references –
and they represent a key reason that we're not always loading from memory.
For short operations, these pauses add up, wasting a huge chunk of potential
bandwidth. In part, this is because Llama-1B (actually 1.24B parameters) in batch
size 1 is just so... small: if each operation is really fast, then the time spent in-
between them really starts to matter.
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 4/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
So while CUDA does provide some existing features (e.g. graphs, streams, PDL) to
partially solve these problems, we wanted to see if a different approach could
solve all of these problems, where we just fuse the entire model forward pass into
a single kernel.
How to Megakernel
Next, we'll show you how we fused a whole Llama forward pass into a single
kernel, and our methods for resolving three key problems:
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 5/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
For our end-to-end Llama forwards pass megakernel, we define the following set
of instructions:
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 6/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
The reason why a megakernel lets us solve this problem is that we can pipeline
memory loads across instructions: our interpreter will start loading the model
weights for an instruction as soon as it can, even if a previous instruction is still
finishing up (e.g. storing out its results to global memory). It's this tight
transitioning between instructions that minimizes the memory bubbles that
would otherwise appear if we launched multiple kernels.
However, there's a catch: loading the weights from global memory for the next
instruction doesn't do you much good if you have no place to put the data you
loaded! More precisely, all of our weight matrices are loaded from GPU global
memory into our SM's "shared memory" – NVIDIA's term for the fast memory on
each SM. Shared memory is a scarce resource on each SM, and we can't start a
load for a new instruction if a previous instruction is using all of it. This
necessitates a way to keep track of which instruction is using which piece of
shared memory and quickly transition shared memory to the next instruction
when the current instruction is done with it.
We accomplish this by paging shared memory. We first divide the first 213kB of
shared memory on an H100 into 13 16KiB pages, and use remaining shared
memory for special purposes, like storing instruction parameters. To use one of
these pages, instructions have to explicitly request and release them from the
interpreter. The interpreter automatically passes released pages to the next
instruction, allowing them to start issuing memory loads as early as shared
memory becomes available.
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 7/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
While megakernels let us minimize pipeline bubbles, they also introduce a new
problem: synchronization. The performance limitation with the normal many-
kernel execution model is that no thread blocks in a kernel can start until all
thread blocks in previous kernels are finished. However, it's precisely this property
that makes it easy to manage data dependencies. When a kernel launches, CUDA
guarantees that all of the kernel's input tensors have already been produced and
are safe to read from immediately.
– In a naive implementation using PDL, one must await completing the whole
hidden state before beginning the down projection matrix multiply.
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 8/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
– We instead produce and consume the intermediate state in four chunks, each
with their own counter. This way, an instruction for the down projection only
needs to wait for its input chunk to finish.
– On an H100, our megakernel runs almost 2.5x faster than vLLM and over 1.5x
faster than SGLang.
– On a B200, the gap with vLLM rises to over 3.5x, and we remain more than 1.5x
faster than SGLang, too.
We're still actually quite a ways off from the theoretical limit on a B200, which is
around ~3,000 forward passes per second. Part of this gap is because this
theoretical limit is based purely on memory bandwidth – but we still have to wait
to load activations. And although these activations are small (and don't cost a lot
of bandwidth), there are still latencies in loading them that we can't hide. A
breakdown of the runtime of our current B200 forward pass (total runtime 600
microseconds):
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 9/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
We think there's probably more to do on each of these, but that'll have to wait for
a future update!
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 10/11
5/31/25, 3:09 PM Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B · Hazy Research
If you'd like to learn more, please reach out to Ben or Jordan! Please include a
tribute of at least five pictures of kittens in your email.
– Ben: bfs@stanford.edu
– Jordan: jbj@stanford.edu
And many, many thanks to Together AI for generously providing us with B200s
and H100s to do this work, which would not have been possible without them!
https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles 11/11