AI Systems Performance Engineering
AI Systems Performance Engineering
Chris Fregly
AI Systems Performance Engineering
by Chris Fregly
See http://oreilly.com/catalog/errata.csp?isbn=9798341627789
for release details.
The views expressed in this work are those of the author and do
not represent the publisher’s views. While the publisher and
the author have used good faith efforts to ensure that the
information and instructions contained in this work are
accurate, the publisher and the author disclaim all
responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any
code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual
property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
979-8-341-62778-9
[LSI]
Brief Table of Contents (Not Yet Final)
Preface (available)
Chris Fregly
San Francisco, California
Chapter 1. Introduction and AI
System Overview
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
The chapter then dives into the core competencies required for
an AI Systems Performance Engineer. We examine the technical
proficiencies essential for the role, including a deep
understanding of hardware architectures, software
optimization techniques, and system-level integration.
Additionally, we discuss the importance of soft skills such as
problem-solving, communication, and collaboration, which are
vital for navigating the interdisciplinary nature of AI projects.
Even minor code tweaks yield major wins. For example, maybe
a data preprocessing step written in Python is holding up an
entire training pipeline. Reimplementing it in C++ or using a
GPU-optimized vectorization library like NVIDIA’s CuPyNumeric
library could remove that bottleneck
You may want to place data cleverly across nodes using data
parallelism. Or you may need to redesign the workload to use
tensor parallelism or pipeline parallelism because the model is
so large that it doesn’t fit onto a single GPU. Perhaps you are
using an Mixture of Experts (MoE) model and can take
advantage of expert parallelism.
Cross-Team Collaboration
Towards 100-Trillion-Parameter
Models
100 trillion parameter models is an aspirational milestone for
AI. 100 trillion is roughly the number of synaptic connections in
the human brain’s neocortex Achieving a model of this size is
theoretically possible, but it demands an extraordinary amount
of resources - and money. Scaling to 100-trillion-parameter
models by brute force would be impractical for all but the
absolute wealthiest organizations.
TIP
While the optimizations discussed in this book can be applied to smaller models and
cluster sizes, I will continue to revisit the 100-trillion-parameter model to enforce the
idea that we can’t just throw hardware at the scaling problem.
This entire NVL72 rack behaves like one giant accelerator to the
user. More details are provided on the compute, memory, and
interconnect hardware details of this supercomputer in
Chapter 2. For now, let’s analyze the overall performance
specifications of this AI supercomputer as a whole in the
context of modern generative AI models.
Each NVL72 rack delivers 1.44 exaFLOPS of AI compute in low-
precision (4-bit) mode and provides 13.5 TB of ultra-fast high-
bandwidth memory (HBM) spread across the 72 GPUs. In
simpler terms, it’s a self-contained 120 kW AI training and
inference AI supercomputer that can train and serve trillion-
parameter models - as well as fit into a single rack in your data
center. And by combining these racks together to form ultra-
scale clusters, you can support massive multi-trillion parameter
models. Even better, you can provision these racks and rack-
clusters with a few clicks (and quite a few dollars!) using your
favorite cloud provider including AWS, GCP, Azure, CoreWeave,
and Lambda Labs.
TIP
While this book focuses heavily on the Grace-Blackwell generation of NVIDIA chips,
the optimization principles discussed are derived from many previous generations of
NVIDIA hardware. And these optimizations will continue to apply and evolve to
many future NVIDIA chip generations to come including Vera-Rubin (2026), Feynman
(2028), and beyond.
TIP
NVIDIA calls the theoretical hardware maximum the “speed of light” as you may
have seen in NVIDIA blogs, documentation, webinars, and conference talks.
TIP
This cost savings is why AI Systems Performance Engineers earn top dollar in the
industry today. We pay for ourselves many times over - especially at scale!
Key Takeaways
The following qualities collectively define the role of the AI
Systems Performance Engineer, whose expertise in merging
deep technical knowledge with strategic, profile-driven
optimizations transforms raw hardware into cost-effective,
high-performance AI solutions.
Measure Goodput.
Conclusion
This introductory analysis underscores that optimizations are
not optional at large scale - they are absolutely necessary. It is
the difference between a system that works and one that is
utterly impractical Traditional approaches, whether in
hardware or algorithms, break down at this scale. To push
forward, we need both advanced hardware and smart software
techniques.
It’s clear that AI models are pushing physical resource limits.
Hardware is racing to keep up with new model architectures
and algorithms. And performance engineers are the ones in the
driver’s seat to ensure that all this expensive machinery is
actually delivering results.
Now, with the context established, let’s dive into the hardware
components of modern AI systems including the CPUs, GPUs,
memory technologies, network fabrics, and storage
mechanisms. By studying the components that underpin
contemporary AI supercomputers, you will learn the
fundamentals that provide the foundation of subsequent deep
dives into optimization techniques in later chapters.
Chapter 2. AI System Hardware
Overview
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
Figure 2-1. NVIDIA Grace-Blackwell Superchip module, containing one Grace CPU
(center) and two Blackwell B200 GPUs (top) on a single module with shared memory
address space.
Now let’s talk about the Blackwell GPU, the brute-force engine
of the superchip.
NVIDIA Blackwell GPU
This is the first time NVIDIA’s flagship data center GPU has used
a chiplet approach. This effectively splits what would be one
enormous GPU into two sizable dies and links them together.
Why do this? Because a single monolithic die is limited by
manufacturing, there’s a limit to how large you can make a chip
on silicon. By combining two physical dies into a single GPU,
NVIDIA can double the total transistor budget for the GPU.
Feeding data at 8 terabytes per second means the GPU cores are
kept busy crunching on huge matrices without frequently
stalling to wait for data. NVIDIA also beefed up on-chip caching
as Blackwell has a total of 100 MB of L2 cache (50 MB on each
die). This cache is a small but ultra-fast memory on the GPU that
holds recently used data. By doubling the L2 cache size
compared to H100’s 50 MB L2 cache, Blackwell can keep more
of the neural network weights or intermediate results on-chip,
avoiding extra trips out to HBM. This again helps ensure the
GPU’s compute units are seldom starved for data.
Before moving on, let’s quickly discuss the hierarchy inside the
GPU, as this is useful to understand performance tuning later.
Each Blackwell GPU has 18 NVLink 5 ports where each port can
support 100 GB/s of data transferred bidirectionally or 50 GB/s
in a single direction. Combined, a single GPU can shuffle up to
1.8 TB/s (18 NVLink ports * 100 GB/s) of data with its peers via
NVLink. This is double the per-GPU NVLink bandwidth of the
previous generation as the Hopper H100 uses NVLink 4 which
runs at half of the bidirectional 900 GB/s speed of NVLink 5.
Each switch tray contains two NVSwitch chips (the large chips
visible), and multiple high-speed ports (the blue cables
represent NVLink connections). In the NVL72 rack, 9 such
switch trays, shown in Figure 2-7, provide the fabric that fully
connects the 72 Blackwell GPUs.
Figure 2-7. NVSwitch System of 9 trays inside an NVL72 rack
Multi-GPU Programming
Let’s analyze and compare the GB200 NVL72 and 72-GPU H100
clusters using concrete numbers. Within a single NVL72 rack,
GPU-to-GPU bandwidth is on the order of 100 GB/s+, and latency
is on the order of 1–2 microseconds for a small message. Across
a conventional InfiniBand network, bandwidth per GPU might
be more like 20–80 GB/s - depending on how many NICs and
their speed - and latency is likely 5–10 microseconds or more.
The NVL72 network offers both higher throughput (2× or more
per GPU) and lower latency (3-5x more) than the best
InfiniBand networks for node-to-node GPU communication. In
practical terms, an all-reduce collective operation which
aggregates gradients across GPUs might consume 20–30% of
iteration time on an InfiniBand-linked H100 cluster, but only
take a 2-3% on the NVLink-connected NVL72 cluster.
TIP
NVIDIA also offers an Ethernet-based solution using Spectrum switches with RDMA
over Converged Ethernet (RoCE) as an alternative for interconnect, called Spectrum-
X.
To supply 120 kW to the NVL72 rack, you can’t just use a single
standard power feed. Data centers will typically provision
multiple high-capacity circuits to feed this kind of power. For
instance, one might push two separate power feeds into the
rack for redundancy where each feed is capable of 60 kW.
Under normal operation, the load is balanced between the
feeds. And if one feed fails, the system could shed some load or
throttle the GPUs to stay within the remaining feed’s capacity.
This kind of redundancy is important to protect against a blown
circuit halting your multi-month training job.
The system might stagger the GPU boost clocks by tiny intervals,
so they don’t all spike at exactly the same microsecond,
smoothing out the surge. These are the kind of electrical
engineering details that go into making a 120 kW rack
manageable.
It’s not far-fetched to call this NVL72 rack, at the cutting edge of
high-density compute, a mini power substation. 8 of these racks
combined for 572 GPUs would draw nearly 1 MW of power (8
racks * 120 kW per rack) which is the entire capacity of a small
data center! The silver lining is that although 120 kW is a lot in
one rack, you are also getting a lot of work done per watt. In
fact, if one NVL72 replaces several racks of older equipment,
the overall efficiency is better. But you definitely need the
infrastructure to support that concentrated power draw. And
any facility hosting the NVL72 racks must ensure they have
adequate power capacity and cooling as we will discuss next.
The NVL72 keeps GPU temps in the 50-70°C range under load
which is excellent for such power-hungry devices. The cold
plates and coolant loops have been engineered very carefully to
allow each GPU to dump 1000 W and each CPU to dump 500 W
into the system. In addition, the coolant flow rate has to be
sufficient to remove that heat quickly. A rough estimate shows
on the order of 10+ liters per minute of water flowing through
the system to dissipate 120 kW of power with a reasonable
temperature increase.
One side effect of the internal liquid cooling is the weight of the
rack. The NVL72 rack weighs on the order of 3000 lbs (1.3–1.4
metric tons) when filled with hardware and coolant. This is
extremely heavy for a rack as it’s roughly the weight of a small
car, but concentrated on a few square feet of floor. Data centers
with raised floors have to check that the floor can support this
load measured in pounds per square foot. Often, high-density
racks are placed on reinforced slabs or supported by additional
struts. Moving such a rack requires special equipment such as
forklifts. This is all part of the deployment consideration as
you’re installing an AI supercomputer which comes with its
unique physical and logistical challenges.
NVIDIA also integrates management and safety features in the
form of a rack management controller that oversees things like
coolant pumps, valve positions, power usage, and monitors
every node’s status. Administrators can interface with it to do
things like update firmware across all nodes, or to shutdown
the system safely.
The GB300 NVL72 Ultra Power and cooling draws 120 kW and
uses liquid cooling, but the benefit is more compute and
memory in the same footprint as the GB200 NVL72. NVIDIA
claims a 1.5× improvement at the rack level for generative AI
workloads relative to GB200, thanks to the beefier GPUs and the
generous usage of FP4 precision. NVIDIA is targeting use cases
like real-time AI agents and multi-modal models that demand
maximum throughput. Essentially, the GB300 is an evolutionary
upgrade as it uses the same architecture, but has more of
everything including more SMs, more memory, and faster
clocks.
Key Takeaways
The following innovations collectively enable NVIDIA’s
hardware to handle ultra-large AI models with unprecedented
speed, efficiency, and scalability.
Ultra-Fast Interconnects.
Future-Proof Roadmap.
The NVL72 and its successors are the core of the AI factory. It’s
the heavy machinery that will churn through mountains of data
to produce incredible AI capabilities. As performance
engineers, we stand on the shoulders of this hardware
innovation. It gives us a tremendous raw capability as our role
is to harness this innovation by developing software and
algorithms that make the most of the hardware’s potential.
In the next chapter, we will transition from hardware to
software. We’ll explore how to optimize the operating systems,
drivers, and libraries on systems like NVL72 to ensure that none
of this glorious hardware goes underutilized. In later chapters,
we’ll look at memory management and distributed
training/inference algorithms that complement the software
architecture.
The theme for this book is co-design. Just as the hardware was
co-designed for AI, our software and methods must be co-
designed to leverage the hardware. With a clear understanding
of the hardware fundamentals now, we’re equipped to dive into
software strategies to improve AI system performance. The era
of AI supercomputing is here, and it’s going to be a thrilling ride
leveraging it to its fullest.
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
GPU Driver
The GPU driver turns on the GPUs’ features and keeps the
hardware fed with work. It’s important to keep the driver up-to-
date as new driver versions often provide performance
improvements and additional support for the latest CUDA
features. Tools like nvidia-smi come with the driver and
allow you to monitor temperatures, measure utilization, query
error-correcting code (ECC) memory status, and enable
different GPU modes like persistence mode.
CUDA Toolkit and Runtime
On top of the driver sits the CUDA runtime and libraries called
the CUDA Toolkit. The toolkit includes the CUDA compiler,
nvcc , used to compile CUDA C++ kernels as we will see in the
next chapter. When compiled, CUDA programs link against the
CUDA runtime ( cudart ). The CUDA runtime communicates
directly with the NVIDIA driver to launch work and allocate
memory on the GPU.
An important feature of GPU programming is that the generated PTX (a.k.a assembly
code for GPUs) is backward-compatible with older NVIDIA GPU hardware and
forward-compatible with newer hardware. This is a big selling point of the NVIDIA
programming model, and it’s something that Jensen Huang, NVIDIA’s CEO, reiterates
with every new hardware release.
While most of the CUDA toolkit libraries are C++ based, more
and more Python-based libraries are emerging from NVIDIA
that are prefixed with “Cu” and built upon the C++ toolkit. For
instance, CuTile and CuPyNumeric are Python libraries
launched in early 2025. They are targeted at lowering the
barrier to entry for Python developers to build applications for
NVIDIA GPUs using CUDA.
Modern server CPUs have dozens of cores and are often split
into multiple Non-Uniform Memory Access nodes. A NUMA
node is a logical grouping of CPUs, GPUs, NICs, and memory
that are physically close to each other. Being aware of the
system’s NUMA architecture is important for performance
tuning. Accessing resources within a single NUMA node is faster
than accessing resources in other NUMA nodes.
TIP
It’s worth noting that, by default, the Linux scheduler will not use a NUMA-aware
scheduling algorithm.
Figure 3-1. 8 GPUs in a node, with 4 GPUs connected to NUMA node 0 and the other 4
to NUMA node 1
import os
import psutil
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataPara
from torch.utils.data import DataLoader, Dataset
# Training loop
...
It’s not a coincidence that NVIDIA tackled the CPU-to-GPU bottleneck in their
hardware by combining the CPU and GPU onto a single superchip. Expect NVIDIA to
keep addressing more and more bottlenecks through hardware innovations.
TIP
The OS has a limit on how much memory a user can lock (pin). This is set with the
ulimit -l <max locked memory> command. If you plan to use large pinned
buffers, ensure this limit is high - or set to unlimited for your user - otherwise the
allocation might fail. Typically, one sets it to unlimited for large AI workloads and
HPC applications.
Transparent Huge Pages
It’s generally recommended that you enable huge pages for big-
memory workloads. Linux uses THP, which tries to
automatically use 2 MB pages whenever possible. It’s usually
enabled by default in modern distributions using either
madvise or always mode. You can check the setting by
reading
/sys/kernel/mm/transparent_hugepage/enabled .
For most deep learning training jobs, it’s beneficial to enable
THP so that it’s transparent to your program. In this case, you
don’t have to change code, but you’ll gain a boost in CPU
efficiency. Note that the gains from huge pages aren’t massive
in every case. You might see a few percent improvement in
throughput due to fewer page faults. For extremely large
memory usage scenarios, one can reserve explicit huge pages
using hugetlbfs , the Linux pseudo-filesystem, for allocating 1
GB pages. However, this requires more manual setup and
configuration. Enabling THP is an easier, simpler win. Once it’s
on, the OS will back large allocations with 2 MB pages
automatically, reducing kernel overhead.
Now, beyond CPU and memory pinning, there are a few other
OS-level tweaks worth mentioning. These include thread
scheduling, virtual memory management, filesystem caching,
and CPU frequency settings.
Likewise, disabling deep C-states can keep cores from going into
a low-power sleep state. CPU C-states are power-saving modes
defined by the system’s ACPI specification. When a CPU core is
idle, it can enter a C-state to save energy. The deeper the C-state,
the more power is saved, but the longer it may take for the core
to “wake up” when work arrives. Disabling deeper C-states can
remove excessive latency spikes. C0 is active, everything above
C0 represents a deeper state of sleep.
The result should be that your CPUs deliver data to the GPUs as
fast as the GPUs can consume it, without the OS scheduling
things on the wrong core or taking CPU cycles away at the
wrong time. On a well-tuned GPU server, you might notice that
CPU usage isn’t extremely high since the GPUs are doing the
heavy lifting, but CPU usage should be consistent and aligned
with the GPUs. The CPUs stay active enough to prepare the next
batch while the current one is processing. Each GPU’s
utilization graph stays near the top, only dipping when
absolutely necessary for synchronization points - and not
because the GPU is waiting on data or stuck on a slow CPU.
Multi-Process Service
If you enable MPS for these inference jobs, the GPUs can
interleave their work so that while one job is waiting on
memory, another job’s kernel might fill the GPU, etc. The result
is higher overall GPU utilization. In practice, if two processes
each use 40% of a GPU, with MPS you might see the GPU at 80-
90% utilization serving both. For instance, two training
processes that each would take 1 hour on their own - on the
same GPU, run sequentially - can run together under MPS and
finish in a bit over 1 hour total in parallel instead of 2 hours
sequentially. This 2× speedup is a result of merging the work of
both training processes and keeping the GPU fully busy.
To visualize, imagine Process A and Process B each launching
kernels periodically without MPS. The GPU schedule might look
like A-B-A-B with gaps in between while each one waits as
shown in Figure 3-2.
Figure 3-2. GPU alternates between running Process A’s kernels and Process B’s
kernels and creates idle gaps in which one process is waiting while the other is active
Note that MPS does not partition GPU memory, so all processes
will share the full GPU memory space. MPS is mainly about
compute sharing and scheduling. The issue is that one process
could request a massive amount of GPU RAM, cause an out-of-
memory (OOM) error on the GPU, and result in terminating all
of the other processes running on the GPU. This is very
disruptive. Also, if one program saturates the GPU 100% on its
own, MPS won’t magically make it go faster as you can’t exceed
100% utilization. It’s only beneficial when individual jobs leave
some slack that others can fill.
Multi-Instance GPU
TIP
The Kubernetes device plugin will list MIG devices as resources like “nvidia.com/mig-
2g.10gb” in the case of a 2 GPU slice of 10 GB.
Jobs can request MIG devices specifically, but you have to be
careful to schedule in a way that uses all slices. For instance, if
you have a 7-slice setup and a job only takes one slice, the other
6 should be packed with other jobs or you’re leaving a lot idle.
It’s possible to configure certain nodes in your cluster to use
MIG for small inference jobs, for example - and configure other
nodes for non-MIG workloads for large training jobs.
You can use nvidia-smi -lgc to lock the core clock and -ac
to lock the memory clock. This can help ensure that the GPU
runs at a consistent frequency, which is especially useful for
benchmarking or when you need reproducible performance
results. By locking the clocks, you prevent fluctuations that
might occur due to the default auto-boost behavior, which can
vary with thermal conditions or power availability.
If you run into the GPU OOM error, which you surely will at
some point, it’s likely caused by memory fragmentation or
excessive memory caching. You can try to clear the cache using
PyTorch’s torch.cuda.empty_cache() , but it almost-always
means your workload legitimately needs that much memory.
Also, you should ensure that your CPU memory isn’t being
swapped as this would indirectly hurt your GPU utilization and
goodput because each time your GPU tries to fetch something
from the CPU host, but the host memory page has been
swapped to disk, your performance will be bottlenecked by the
much-slower disk I/O. So it’s important to combine these
memory-reduction best practices with the earlier advice about
pinning memory, increasing the ulimit , and disabling
swappiness, etc.
The general rule is that the host’s NVIDIA driver version must
be at least as recent as the minimum driver version required by
the CUDA version inside the container. If you’re running a
container with CUDA 12.8, your host must have an NVIDIA
driver version of 570.124.06 or higher. If your host has an older
driver version, a container with CUDA 12.8 may not work
properly.
TIP
The simplest approach is to use NVIDIA’s official base Docker images from the
NVIDIA GPU Cloud (NGC) or DockerHub image repositories. The images in these
repositories are well tested and describe which NVIDIA driver version they need on
the host. End Tip.
TIP
Persistence mode is recommended when using MIG so that the MIG configuration
remains active on the GPU even if no jobs are running. This way, the GPU doesn’t
have to keep re-building the slices before running each periodic job.
Optimizing Network Communication for
Kubernetes
TIP
Homogeneous workloads such as all training or all inference are much easier to
debug and tune from a system’s perspective than a heterogeneous mix of both
training and inference.
TIP
With proper monitoring and alerting, you can ensure the job
doesn’t try to over-allocate beyond what you expect. If you do
set a memory limit, make sure it’s above what you actually
expect to use. This provides a bit of headroom to avoid getting
killed by the OOM killer 3 days into a long-running training job.
It’s a good idea to always ensure that the host machine is tuned
since containers can’t change kernel parameters like hugepage
settings or CPU governor limits. Usually, cluster admins set
these parameters and settings through the base OS image. Or, in
a Kubernetes environment, they might use something like the
NVIDIA GPU Operator to set persistence mode and other
sysctl knobs on each node.
Key Takeaways
Below is a list of key takeaways from this chapter including
optimizations across the operating system, driver, GPU, CPU,
and container layers.
Conclusion
This chapter has demonstrated that even the most advanced
GPUs can be hindered by inefficiencies in their surrounding
environment. A well-tuned operating system and GPU software
stack form the unsung backbone of high-performance AI
systems. By aligning data with compute through NUMA-aware
pinning and local storage solutions, overlapping
communication with computation, and fine-tuning both the
host system and GPU drivers, you can dramatically reduce
latency and boost throughput. A well-tuned operating system,
container runtime, cluster orchestrator, and software stack
form the backbone of high-performance AI systems.
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
If you only have Ethernet, try to ensure that it’s the highest
bandwidth possible - at least 100 Gbit/s. Also make sure your
Ethernet networking stack is tuned to use large MTU jumbo
frames like 9000 bytes so that you send fewer big packets rather
than many small ones. This is the same intuition as with large
files vs. small files. Fewer, larger packets will create less
overhead than more, smaller packers. Lastly, ensure your TCP
buffer sizes are high enough so the system can have a lot of
data in flight.
TIP
To create a controlled cluster network in a cloud environment such as AWS, your on-
premise data center would need to use a dedicated, managed link to AWS using AWS
Direct Connect. However, if your connection between your on-premise data center
and AWS runs over the public internet in any way, it would not typically be described
as a controlled cluster network due to variable congestion and unpredictable
network conditions.
Since RDMA can move data without using the CPU to copy the
data, the CPU still plays a critical role. The CPU sets up RDMA
connections, initiates the transfer, handles completion
interrupts, and manages the control path. Therefore, you
should make sure the network interrupts and threads are
pinned to a CPU in the same NUMA node as your InfiniBand
host channel adapter (HCA). For instance, if an InfiniBand HCA
is in NUMA node 0, you want its interrupts handled by threads
running in CPU cores connected to the same NUMA node 0 to
reduce latency and improve overall efficiency.
To debug NCCL, one can use the NCCL Profiler Plugin API to
monitor the internal timeline of GPU communications and
pinpoint any lagging device or bottleneck in the system. The
NCCL Profiler Plugin API is designed to address performance
issues that become increasingly difficult to diagnose as GPU
clusters scale up.
NCCL registers GPU memory for direct data transfers over high-
speed interconnects like NVLink or even directly to a network
card, bypassing the extra overhead of moving data through host
memory. This direct path is essential for minimizing latency
and maximizing throughput during operations like all-reduce,
which aggregates gradients across GPUs.
NCCL_DEBUG
NCCL_ALGO
This controls how many CPU threads each GPU uses for
NCCL’s networking operations. If you find that your CPU
resources are underutilized, increasing NCCL_NTHREADS
can allow more concurrent processing of network tasks,
thereby boosting the overall network-link utilization. In
distributed training, where large amounts of gradient
data need to be communicated, having more threads
dedicated to managing these transfers can help reduce
bottlenecks and improve synchronization speed.
NCCL_BUFFSIZE
NCCL_IB_HCA
NCCL_IB_TIMEOUT
NCCL_SOCKET_IFNAME
For setups where NCCL falls back to TCP (or when using
Ethernet directly), you can specify which network
interface should be used. This is critical in multi-homed
hosts where you want the communication to occur over a
specific high-speed interface rather than an unrelated
one.
NCCL_LL_THRESHOLD
NCCL_NCCL_MAX_RINGS
Figure 4-3.
The NIXL Core manages the metadata and memory buffers. The
NIXL Backend API interfaces with various transport backends
like UCX, GPUDirect Storage, S3, or a custom backend. NIXL can
efficiently move data between different tiers such as GPU HBM,
CPU memory (DRAM), file storage (NVMe SSD), and object
storage. The NIXL API abstracts the complexity of transferring
data across heterogeneous memory and storage devices in a
distributed inference setting.
NIXL picks the most efficient route for each data transfer and it
uses zero-copy transfers when possible to avoid needless copy
steps. For instance, NIXL avoids copying data to a bounce buffer
in the host’s CPU memory.
TIP
GPUDirect Storage fulfills a similar role to GPUDirect RDMA, with the key difference
being that it is tailored for accelerated disk I/O rather than direct GPU-to-GPU
communication.
Large model training jobs usually need to read huge data sets.
For example, it’s common to have billions and trillions of tokens
of text, billions of images, hundreds of thousands of hours of
audio, etc. If you try to stream this from a single spinning disk,
you’ll be wasting a lot of money as your GPUs will be starving
for data since the spinning disks cannot stream the data fast
enough. That’s why most serious AI systems use either a
parallel storage system or large, fast, and local solid-state disks
(SSDs).
Also, make sure the NVMe driver is using all available queues -
it should by default. This will maximize I/O throughput. And if
you are in a virtual environment, ensure the virtio drivers
are up to date to ensure optimal I/O performance.
File systems like XFS and EXT4 are common and should both be
tuned. XFS is often recommended for parallel throughput on
multi-core systems. Ensure mount options aren’t introducing
overhead. For example, you may want to disable atime (access
time) and avoid writing the extra metadata if you don’t need
this information.
With GDS, the GPU’s DMA engine can initiate reads directly
from the storage device into its GPU memory. Using GDS,
reading data into GPU memory skips the extra hops through the
CPU memory buffers. The obvious benefit of GDS is that it
reduces CPU usage since the CPU isn’t managing those data
transfers - or touching the data in any way. It can also reduce
latency slightly.
However, in practice, not all storage stacks are GDS-ready as
GDS requires special hardware, software, and drivers. However,
a lot of modern, high-end NVMe drives and RAID controllers
now support GDS. Programmatically, one can use NVIDIA’s
cuFile library with GDS by reading a file with cuFileRead
which retrieves the data straight into the GPU’s memory buffer.
If you have 8 GPUs per node and each GPU is performing, say, 1
GB/s of gradient all-reduce communications, but you have a 100
Gbit/s (12.5 GB/s) network. In this case, your communication is
running at approximately 1/12th of the network’s overall
bandwidth so you are not exceeding the network’s bandwidth -
assuming nothing else is running over the network, of course.
import torch
from torch.utils.data import DataLoader
The general rule is to profile the data pipeline and find out
which part of your pipeline is the limiting factor. If your GPUs
are relatively idle at only 50% utilization and you notice that
each batch takes a long time to load, you likely need to tune
your data pipeline.
Key Takeaways
The key lessons from this chapter remind us that the
performance of AI systems is determined by the entire full-
stack including software and hardware ecosystems.
Conclusion
The evolution of distributed, multi-GPU communication
libraries and strategies represents a pivotal shift in high-
performance deep learning. By adopting specialized libraries
such as NCCL for collective operations, NIXL for efficient
inference data transfers, and RDMA for ultra-low latency
communication, systems can dramatically reduce data
movement bottlenecks. The integration of container runtimes,
Kubernetes orchestration, and intelligent scheduling further
ensures that these optimized pipelines translate directly into
improved training and inference performance. Additionally, by
addressing the challenges in storage and I/O through advanced
techniques like GPUDirect Storage and intelligent data caching,
modern AI deployments can sustain high throughput even as
model complexity scales.
Ultimately, this chapter underscores that no single component
can provide peak performance alone. It is the careful
coordination of high-speed communication, efficient data
handling, and system-wide tuning that leads to scalable, robust
AI systems capable of tackling some of the most demanding
challenges in today’s computational landscape. This integrative
perspective not only streamlines AI workflows but also lays a
strong foundation for future innovations in distributed deep
learning.
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
A key takeaway was that scaling these training runs isn’t only
about adding more compute. It requires a delicate balance
between long-term planning and rapid problem solving,
ensuring that every fix improves both data efficiency and
model intelligence. The teams discovered that through co-
designing, choices made for model design (e.g. scaling laws and
pre-training objectives), had a direct impact on infrastructure
needs (e.g. multi-cluster orchestration and network reliability)
and vice-versa. Even with careful upfront planning, some
degree of unpredictability remained, and the process
demanded persistent monitoring and agile, cross-team
collaboration.
Figure 5-1. 47% improvement in average step time using PyTorch compile for kernel
fusion at FP8 precision (Source: PyTorch Native FP8 Data Types. Accelerating PyTorch
Training Workloads… | by Chaim Rand | TDS Archive | Medium).
The ROI of adopting vLLM is clear for any AI service with spiky
or concurrent request loads. Higher throughput per dollar on
the same hardware avoids expensive hardware upgrades.
Importantly, NVIDIA recognized the value of such software-
level optimizations with its release of the Dynamo serving
framework. Dynamo natively supports vLLM as a backend. This
allows organizations to combine vLLM’s algorithmic and GPU-
level optimizations with the cluster-level routing and cache
management optimizations of NVIDIA Dynamo.
Figure 5-2. Strassen’s sub-quadratic algorithm for multiplying 2x2 matrices. (Source:
https://en.wikipedia.org/wiki/Strassen_algorithm)
But the real proof came when those algorithms were tested on
actual hardware. AlphaTensor discovered a method specific to
the NVIDIA Volta V100 GPU generation which multiplied large
matrices 10–20% faster than the standard GPU library could at
the time. A 10–20% speedup in GEMM performance is huge. It’s
like gaining an extra 10–20% in free compute for every model’s
forward and backward pass. Such gains typically come from a
new hardware generation - or months of low-level CUDA
tuning. Yet, in this case, the AI found a better way
mathematically in a relatively-short amount of time.
Figure 5-4.
TIP
HPE’s Grace-Blackwell
Supercomputer for the Trillion-
Parameter Era
In early 2025, Hewlett Packard Enterprise shipped its first
NVIDIA Grace-Blackwell system called the HPE Cray XD. This is
their implementation of the GB200 NVL72 rack specification.
This marks one of the first real-world deployments of this
technology outside of NVIDIA’s own labs. The HPE Cray XD rack-
scale system is built for organizations that need to train and
serve models at the 1-trillion parameter scale using a single,
unified memory space.
HPE emphasizes that such systems offer lower cost per token
training and best-in-class throughput for ultra-scale models
(HPE announces shipment of its first NVIDIA Grace Blackwell
system | HPE). This means that even though the absolute cost of
their rack is high, the efficiency - in terms of tokens processed
per second per dollar invested - is significantly higher when
models are at the trillion-scale. The target users are cloud AI
service providers and large enterprise research groups.
The lessons learned by the HPE engineers who first used the
HPE Cray XD are that, with all 72 GPUs in one domain,
debugging and optimizing parallel training jobs became easier
than on a traditional 8-GPU-per-node cluster as there were
fewer moving pieces in terms of network communication
patterns. However, they also learned about failure modes
unique to the NVL72 system such as faults in the
NVLink/NVSwitch fabric. This type of failure could impact
many GPUs at once with the NVL72 rack design. Previously, a
bad InfiniBand link would affect only one node.
However, the costs and risks are very high. Training might cost
tens or hundreds of millions of dollars in GPU compute time.
The engineering must ensure that each of the 576 GPUs is
utilized to its maximum the entire time. Any inefficiency could
waste compute hours and money. Techniques proven at
relatively-small scale - such as those used by DeepSeek to train
their V3 model on a limited-capability NVIDIA H800’s training
cluster - would be mandatory at this 100 trillion parameter
scale.
Self-Optimizing Algorithms.
Conclusion
The journey through these case studies narrates a pivotal era in
AI systems performance engineering marked by the seamless
integration of NVIDIA’s groundbreaking GPU and CPU
architectures including the rack-scale Grace-Hopper and Grace-
Blackwell systems of today - as well as the Vera-Rubin and
Feynmann systems of tomorrow. By fusing the CPU and GPU
into a superchip module, NVIDIA has redefined the capabilities
of LLMs and achieved unprecedented levels of efficiency and
scalability.
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
Figure 6-2. Graph execution in CUDA reduces overhead when launching multiple
kernels in a sequence (Source: https://pytorch.org/blog/accelerating-pytorch-with-
cuda-graphs/)
TIP
Even though sparse throughput is given in TOPS instead of TFLOPS, it still uses
floating‑point math. The difference is purely in how many operations are actually
executed. TOPS count only the valuable operations - not the skipped operations.
NVIDIA has been preparing for this quantum future with its
CUDA Quantum (CUDA-Q) initiative which provides a unified
programming model for hybrid quantum-classical computing.
The basic idea is that tomorrow’s HPC clusters might have
quantum processing units (QPUs) sitting alongside GPUs and
CPUs - each tackling tasks for which they are uniquely
optimized.
It’s worth noting that you can currently use GPUs to simulate
quantum algorithms during development. NVIDIA’s cuQuantum
SDK, for instance, allows fast simulation of moderate-qubit
counts on GPUs. If someone designs an AI algorithm involving,
say, a quantum program that requires 30 qubits, we could
simulate those qubits on the GPU during development and
testing.
It’s not surprising that frontier research labs like xAI, OpenAI,
and Microsoft are reportedly planning large, single clusters of
100,000 and 1,000,000 GPUs. At a 100-trillion parameter scale,
you might have one job spanning an entire datacenter’s worth
of hardware. Performance engineers must think at datacenter
and multi-datacenter (global) scale.
Key Takeaways
Unified Computing Engines.
AI‑Assisted Optimization.
3D Memory Integration.
Conclusion
AI Systems Performance Engineering is entering its golden era.
This is both challenging and exciting. The frontier of what’s
possible keeps expanding. Models are hundred times larger. The
demand for real-time inference to power advanced reasoning
and AGI models is reaching an unprecedented scale. New
hardware paradigms continue to emerge, and AI optimizing AI
is a reality. The case studies we examined earlier each
highlighted how creative engineering turns ambitious ideas
into reality. Now, the emerging trends we’ve discussed will
carry those lessons forward into the future.
After all, every great leap in AI, from the earliest multi-layer
perceptrons (MLPs) to today’s massive Transformer-based GPT
and MoE models, has many unsung heroes making it run
efficiently and cost-effectively behind the scenes. In the era of
100-trillion-parameter models and beyond, you could be one of
those heroes and ensure that the next generation of AI is
powerful, efficient, sustainable, and brilliantly engineered. The
adventure is just beginning, and I, for one, can’t wait to see
what we accomplish next.
Chapter 7. AI Systems Performance
Checklist (175+ Items)
With Early Release ebooks, you get books in their earliest form
—the author’s raw and unedited content as they write—so you
can take advantage of these technologies long before the official
release of these titles.
Topology-Aware Scheduling.
Always co-locate multi-GPU jobs within an NVLink Switch
domain if possible. Keeping all GPUs of a job on the NVL72
fabric means near-linear scaling for communication-
heavy workloads. Mixing GPUs across NVLink domains or
standard networks will introduce bottlenecks and should
be avoided for tightly-coupled tasks.
Future Expansion.
Be aware that NVLink Switch can scale beyond a single
rack – up to 576 GPUs in one connected domain via
second-level switches. If you operate at that ultra-scale,
plan hierarchical communication using local NVL72 inter-
rack collectives first, then use inter-rack interconnects
only when necessary. This helps to maximize intra-rack
NVLink usage first. This ensures you’re using the fastest
links before resorting to inter-rack InfiniBand hops.
Client-Side Caching.
Manual Prefetching.
Cooperative Groups.
Kernel Fusion.
Always load data ahead of the iteration that needs it. Use
background data loader threads or processes such as
PyTorch DataLoader with prefetch_factor . Pin
memory on host ( pin_memory=True ) so Host->Device
transfers are faster For distributed training, use
DistributedSampler to ensure each process gets
unique data to avoid redundant I/O.
Compress Communication.
Use Quantization.
Consolidate Workloads.
Power Monitoring.