Compute Express Link Overview
Compute Express Link Overview
2022-06-01
Intro
Compute Express Link (CXL) is the next spec of significance for connecting hardware devices. It will
replace or supplement existing stalwarts like PCIe. The adoption is starting in the datacenter, and the
specification definitely provides interesting possibilities for client and embedded devices. A few years
ago, the picture wasn't so clear. The original release of the CXL specification wasn't Earth shattering.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 1/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
There were competing standards with intent hardware vendors behind them. The drive to, and
Compute Express Link (CXL) is the next big specification for connecting hardware devices. It will replace or supplement the existing backbone
quantity, such as PCIe. Data centers are starting to adopt it, and the specification certainly offers interesting possibilities for clients and embedded devices.
A few years ago, the situation was not so clear. The initial release of the CXL specification was not earth-shattering. There were a number of competing bids at the time.
Yes, there are dedicated hardware vendors behind these standards. The push and release of the Compute Express Link 2.0 specification changes
This situation.
There are a bunch ofÿreally great materialsÿhosted by the CXL consortium. I find that these are primarily
geared toward use cases, hardware vendors, and sales & marketing. This blog series will dissect CXL
with the scalpel of a software engineer. I intend to provide an in depth overview of the driver
architecture, and go over how the development was done before there was hardware. This post will
The CXL Alliance provides a lot of great material. I found that this information is mainly geared towards use cases, hardware vendors, and sales
and marketing. This blog series will use the skills of a software engineer to dissect CXL. I intend to provide an in-depth look at driver architecture
Get an overview and review how development was done before hardware. This article will introduce the important parts of the specification that I have seen.
point.
All spec references are relative to the 2.0 specification which can be obtainedÿhere.
All specification references are relative to the 2.0 specification, available here.
What it is
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 2/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
If one were creating a PCIe based memory expansion device today, and desired that device expose
coherent byte addressable memory there would really be only two viable options.
One could expose this memory mapped input/output (MMIO) via Base Address Register
(BAR). Without horrendous hacks, and to have ubiquitous CPU support, the only sane way this can
work is if you map the MMIO as uncached (UC), which has aÿdistinct performance penalty. For
more details on coherent memory access to the GPU, see myÿprevious blog post. Access to the
device memory should be fast (at least, not limited by the protocol's restriction), and we haven't
Memory Region (PMR) which sort of does this but is still limited.
If someone were to create a PCIe-based memory expansion device today and wanted the device to expose relative byte-addressable
memory, then there are really only two viable options. One can expose this via the Base Address Register (BAR)
Memory mapped input/output (MMIO). If there are no terrible hacking techniques and universal CPU support, the only solution is
A reasonable approach is to map MMIO as Uncached (UC), which has a significant impact on performance. About consistency with GPU
For more details on sexual memory access, please see my previous blog post. Access to device memory should be fast (at least, not subject to
limitations of the protocol), and we have not yet managed to accomplish this goal. In fact, the NVMe 1.4 specification introduced persistent memory
If one were creating a PCIe based device whose main job was to do Network Address
Translation (NAT) (or some other IP packet mutation) that was to be done by the CPU,
critical memory bandwidth would be need for this. This is because the CPU will have to
read data in from the device, modify it, and write it back out and the only way to do this
If you create a PCIe-based device, its main job is to do Network Address Translation (NAT) (or some other IP number
packet mutation), this will be done by the CPU and requires critical memory bandwidth for this. This is because the CPU will have to pull the
Read the data, modify it, and write it back, and the only way to do it over PCIe is through main memory. (Down
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 3/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
CXL defines 3 protocols that work on top of PCIe that enable (Chapter 3 of the CXL 2.0 specification) a
general purpose way to implement the examples. Two of these protocols help address our 'coherent
but fast' problem above. I'll call them the data protocols. The third, CXL.io can be thought of as a
stricter set of requirements over PCIe config cycles. I'll call that the enumeration/configuration
protocol. We'll not discuss that in any depth as it's not particularly interesting.
CXL defines 3 protocols that work on top of PCIe and can (Chapter 3 of the CXL 2.0 specification) be used in a common way
formula to implement these examples. Two of these protocols help solve our "similar but fast" problem above. I call them
for the data protocol. Third, CXL.io can be considered a more stringent set of requirements for the PCIe configuration cycle. I call it a piece
Lift/Configure protocol. We won't go into this issue in depth because it's not particularly interesting.
There's plenty of great overviews such asÿthis one. The point of this blog is to focus on the specific
There are many good overviews, such as this one. The point of this blog is to focus on specific aspects that driver writers and commentators may be concerned about.
ÿÿ
CXL.cache
But first, a bit on PCIe coherency. Modern x86 architectures have cache coherent PCIe DMA. For
DMA reads this simply means that the DMA engine obtains the most recent copy of the data by
requesting it from the fabric. For writes, once the DMA is complete the DMA engine will send an
invalidation request to the host(s) to invalidate the range that was DMA'd. Fundamentally however,
using this is generally not optimal since keeping coherency would require the CPU basically snoop
the PCIe interconnect all the time. This would be bad for power and performance. As such, general
drivers manage coherency via software mechanisms. There are exceptions to this rule, but I'm not
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 4/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
But first, let’s introduce PCIe compliance. Modern x86 architectures have cache-coherent PCIe DMA. For DMA reads, this
Just means that the DMA engine gets the latest copy of the data by requesting it from the structure. For writes, once the DMA is complete, the DMA
The engine will send an invalidation request to the host to invalidate the DMAed range. Fundamentally, however, using this approach
Typically not the best option since maintaining consistency requires the CPU to keep snooping on the PCIe interconnect. This is detrimental to both power consumption and performance
Profitable. Therefore, typical drivers manage consistency through software mechanisms. There are exceptions to this rule, but I'm not familiar with them
CXL.cache is interesting because it allows the device to participate in the CPU cache coherency
From a software perspective, it's the less interesting of the two data protocols. Chapter 3.2.x has a lot of
words around what this protocol is for, and how it is designed to work.ÿ
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 5/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
This protocol is targeted towards accelerators which do not have anything to provide to the system in
terms of resources, but instead utilize host-attached memory and a local cache.
The CXL.cache protocol if successfully negotiated throughout the CXL topology, host to endpoint,
should just work. It permits the device to have coherent view of memory without software
intervention and on x86, without the potential negative ramifications of the snoops.ÿ
Similarly, the host is able to read from the device caches without using main memory as a stopping
point.ÿ
Main memory can be skipped and instead data can be directly transferred over CXL.cache protocol.ÿ
As a software person, I consider it a more efficient version of PCIe coherency, and one which
CXL.cache is interesting because it allows the device to participate in the CPU cache coherence protocol as if it were another CPU instead of
A device. From a software perspective, it is the less interesting of the two data protocols. Chapter 3.2.x surrounding this protocol
There's a lot to be said for its purpose, and the way it's designed to work. This protocol is aimed at those resources that are not provided to the system.
An accelerator for anything that instead utilizes host-connected memory and local cache. CXL.cache protocol if the entire CXL topology
Successfully negotiated in the structure, from the host to the terminal, it should be able to work. It allows the device to communicate internally without software intervention
There is a consensus that on x86 there are no potential snooping side effects. Likewise, the host can read from the device cache
fetch without using main memory as a stopping point. Main memory can be skipped, and data can be transferred directly through the CXL.cache protocol
lose. This protocol describes snoop filtering and necessary information to maintain consistency. As a software person, I think this is PCIe
a more efficient version of consistency and goes beyond the specificities of x86.
Protocol
CXL.cache has a bidirectional request/response protocol where a request can be made from host to
device (H2D) or vice-versa (D2H). The set of commands are what you'd expect to provide proper
snooping. For example, H2D requests with one of the 3 Snp* opcodes defined in 3.2.4.3.X, these allow
to gain exclusive access to a line, shared access, or just get the current value; while the device uses
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 6/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
CXL.cache has a two-way request/response protocol, a request can be from host to device (H2D) or vice versa
(D2H). This set of commands is what you would expect to provide proper snooping. For example, H2D uses 3 Snp* defined in 3.2.4.3.X
A request in an opcode, these allow to get exclusive access to a line, shared access, or just get the current value; set
Prepare to read/write/invalidate/flush (similar uses) using one of several commands in Table 18.
One might also notice that the peer to peer case isn't covered. The CXL model however makes every
device/CPU a peer in the CXL.cache domain. While the current CXL specification doesn't address
skipping CPU caches in this matter entirely, it'd be a safe bet to assume a specification so
comprehensive would be getting there soon. CXL would allow this more generically than NVMe.
One may also notice that the peer-to-peer scenario is not covered. However, the CXL model makes every device/CPU a CXL.cache
Peers in the domain. Although the current CXL specification does not completely solve the problem of skipping CPU cache, it is certain that if
This comprehensive specification will appear soon. CXL will allow this more generally than NVMe.
To summarize, CXL.cache essentially lets CPU and device caches to remain coherent without
In summary, CXL.cache essentially keeps the CPU and device caches consistent without using main memory as a synchronization barrier.
CXL.mem
If CXL.cache is for devices that don't provide resources, CXL.mem is exactly the opposite. CXL.mem
allows the CPU to have coherent byte addressable access to device-attached memory while
maintaining its own internal cache. Unlike CXL.cache where every entity is a peer, the device or host
sends requests and responses, the CPU, known as the "master" in the CXL spec, is responsible for
sending requests, and the CXL subordinate (device) sends the response. Introduced in CXL 1.1,
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 7/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
CXL.mem was added forÿType 2ÿdevices. Requests from the master to the subordinate are "M2S" and
If CXL.cache is for devices that do not provide resources, then CXL.mem is just the opposite. CXL.m allows the CPU to
Connected memory performs coherent byte-addressed accesses while maintaining its own internal cache. Unlike CXL.cache, each
Each entity is a peer. The device or host sends requests and responses. The CPU (called "master" in the CXL specification) is responsible for sending requests.
request, and the CXL subordinate (device) sends a response. CXL.mem introduced in CXL 1.1 was added for Class 2 devices. from master station
The request to the lower level is "M2S" and the response is "S2M".
When CXL.cache isn't also present, CXL.mem is very straightforward. All requests boil down to a
read, or a write. When CXL.cache is present, the situation is more tricky. For performance
improvements, the host will tell the subordinate about certain ranges of memory which may not
need to handle coherency between the device cache, device-attached memory, and the host cache.
There is also meta data passed along to inform the device about the current cacheline state. Both the
When CXL.cache does not exist, CXL.mem is very straightforward. All requests boil down to read or write. When CXL.cache exists
, the situation becomes more difficult. To improve performance, the host will tell subordinates the range of certain memories that may not be needed
Handles consistency between device cache, device-attached memory, and host cache. There is also some metadata passed to inform the device of the current
The previous cache line state. Both the master and slaves need to keep their cache states in harmony.
Protocol
CXL.mem protocol is straight forward, especially when the device doesn't also use CXL.cache (ie. it has
no local cache). The CXL.mem protocol is straightforward, especially when the device does not use CXL.cache (i.e. it has no local cache).
local cache).
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 8/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
Req - Request without data. These are generally reads and invalidates, where the response will put
Req - A request with no data. These are usually read and invalid, and the response will put the data on the S2M data channel.
RwD - Request with data. These are generally writes, where the data channel has the data to
write.
RwD - Request with data. These are usually written, and the data channel has data to write.
The responses:
NDR - No Data Response. These are generally completions on state changes on the device side,
NDR - No data response. These are typically completions of device-side state changes, such as writeback completions.
DRS - Data Response. This is used for returning data, ie. on a read request from the master.
DRS - Data Response. This is used to return data, that is, on a read request from the master site.
Bias controls
Unsurprisingly, strict coherency often negatively impacts bandwidth, or latency, or both. While it's
generally ideal from a software model to be coherent, it likely won't be ideal for performance. The CXL
specification has a solution for this. Chapter 2.2.1 describes a knob which allows a mechanism to
provide the hint over which entity should pay for that coherency (CPU vs. device).ÿ
For many HPC workloads, such as weather modeling, large sets of data are uploaded to an
accelerator's device-attached memory via the CPU, then the accelerator crunches numbers on the
data, and finally the CPU download results. In CXL, at all times, the model data is coherent for both
the device and the CPU. Depending on the bias however, one or the other will take a performance
hit.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 9/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
Not surprisingly, strict consistency often has a negative impact on bandwidth or latency, or both. Although from the software model
It seems that coherence is usually ideal, but may not be ideal for performance. The CXL specification has a solution for this. No.
Chapter 2.2.1 describes a knob that allows a mechanism to provide support for which entity should pay for that consistency (CPU vs. device).
ÿ. For many HPC workloads, such as weather modeling, large data sets are uploaded via the CPU into the accelerator's device connection.
The data is stored, then the accelerator calculates the data, and finally the CPU downloads the results. In CXL, at any time, the model data
The device and CPU are the same. However, depending on the preference, one of them will suffer from performance impacts.
CPU writes data to device-attached memory. CPU writes data to device-attached memory.
GPU reads data from device-attached memory. GPU reads data from device-attached memory.
*GPU writes data to device_attached memory. GPU writes data to the memory connected to the device.
CPU reads data from device attached memory. CPU reads data from device attached memory.
*#3 above poses an interesting situation that was possible only with bespoke hardware. The GPU could
in theory write that data out via CXL.cache and short-circuit another bias change. In practice
#3 above raises an interesting situation that is only possible with custom hardware. In theory, a GPU can pass
CXL.cache writes the data out and short-circuits another bias change. But in practice, many such uses will exhaust the high speed
live.
The CPU coherency engine has been a thing for a long time. One might ask, why not just use that and be
done with it. Well, easy one first, a Device Coherency Engine (DCOH) was already required for
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 10/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
More practically however, the hit to latency and bandwidth is significant if every cacheline access
What this means is that when the device wishes to access data from this line, it must first determine
the cacheline state (DCOH can track this), if that line isn't exclusive to the accelerator, the
accelerator needs to use CXL.cache protocol to request the CPU make things coherent, and once
Why is that? If you recall, CXL.cache is essentially where the device is the initiator of the request,
Consistency engines for CPUs have been around for a long time. Some people may ask, why not just use this engine and then
That's it. Well, first of all it is simple, the Device Consistency Engine (DCOH) is already required to support the CXL.cache protocol
of. However, more realistically, if every cache line access needs to be checked on the CPU's coherency engine, then the
The impact on latency and bandwidth is huge. This means that when a device wants to access this row of data, it must first determine the cache
The state of the line (DCOH can track this), if this line is not unique to the accelerator, the accelerator needs to use
The CXL.cache protocol is used to request the CPU to make it consistent, and once completed, the device-connected memory can be accessed. This is for
what? If you remember, CXL.cache is essentially the device as the initiator of the request, while CXL.mem is the CPU as the initiator
By.
So suppose we continue on this CPU owns coherency adventure. #1 looks great, the CPU can quickly
upload the dataset. However, #2 will immediately hit the bottleneck just mentioned. Similarly for #3,
even though a flush won't have to occur, the accelerator will still need to send a request to the CPU
to make sure the line gets invalidated. To sum up, we have coherency, but half of our operations are
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 11/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
So, let's say we continue on this adventure of CPU having consistency. #1 looks good, the CPU uploads the data quickly
set. However, #2 will immediately hit the bottleneck just mentioned. Likewise for #3, even if a refresh is not required, the accelerator still needs
A request is sent to the CPU to ensure that the row is deprecated. In summary, we have coherence, but half of the operations require
Slower.
To address this, a [fairly vague] description of bias controls is defined. When in host bias mode, the
CPU coherency engine effectively owns the cacheline state (the contents are shared of course) by
requiring the device to use CXL.cache for coherency. In device bias mode however, the host will use
CXL.mem commands to ensure coherency. This is whyÿType 2ÿdevices need both CXL.cache and
CXL.mem.
To solve this problem, a [rather vague] description of bias control is defined. When in host bias mode, the CPU
The consistency engine effectively owns the cache line state by requiring the device to use CXL.cache for consistency (of course the content is shared
of). However, in device-biased mode, the host will use the CXL.mem command to ensure consistency. This is why Class 2 devices
Device types
I'd like to know why they didn't start numbering at 0. I've already talked quite a bit about device types. I
believe it made sense to define the protocols first though so that device types would make more sense.
CXL 1.1 introduced two device types, and CXL 2.0 added a third.ÿAll types implement CXL.io, the
I'm wondering why they don't start numbering from 0, I've talked quite a bit about device types. I believe in defining the protocol first
It makes sense so that the device type will make more sense. CXL 1.1 introduced two device types, while CXL 2.0 added a third
kind. All types implement CXL.io, the less exciting protocol we ignored.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 12/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
Just from looking at the table it'd be wise to ask, ifÿType 2ÿdoes both protocols, why do Type 1 and Type
3 devices exist. In short, gate savings can be had with Type 1 devices not needing CXL.mem, and Type 3
devices offer gate savings and increased performance because they don't have to manage internal cache
Just looking at the table, we would ask, if type 2 implements both protocols at the same time, why do there still exist types 1 and 3 devices?
exist. In short, the first type of equipment can save the number of gates without requiring CXL.mem, and the third type of equipment can save the number of gates and improve performance.
Yes, because they do not need to manage internal cache consistency. More to come...
The quintessential Type 1 device is the NIC. A NIC pushes data from memory out onto the wire, or
pulls from the wire and into memory. It might perform many steps, such as repackaging a packet, or
encryption, or reordering packets (I dunno, not a networking person).ÿOur NAT example aboveÿis one
such case.
The most typical Class 1 device is the NIC. The network card pushes data from memory to the line, or pulls data from the line to memory. It may perform licensing
Multiple steps like repackaging packets, or encrypting, or reordering packets (I don't know, not a networker). Let's go
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 13/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
How you might envision that working is the PCIe device would write the incoming packet into the Rx
buffer. The CPU would copy that packet out of the Rx buffer, update the IP and port, then write it into the
Tx buffer. This set of steps would use memory write bandwidth when the device wrote into the Rx buffer,
memory read bandwidth when the CPU copied the packet, and memory write bandwidth when the CPU
writes into the Tx buffer. Again, NVMe has a concept to support a subset of this case for peer to peer
DMA called Controller Memory Buffers (CMB), but this is limited to NVMe based devices, and
doesn't help with coherency on the CPU. Summarizing, (D is device cache, M is memory, H is
Host/CPU cache)
The way you might imagine this works is that the PCIe device writes incoming packets into an Rx buffer. CPU will copy that data from Rx buffer
packet, update the IP and port, and write it to the Tx buffer. When the device writes to the Rx buffer, this set of steps will use memory to write the tape
Wide, when the CPU copies the packet, the memory read bandwidth is used, and when the CPU writes to the Tx buffer, the memory write bandwidth is used. same,
NVMe has a concept that supports a subset of this for peer-to-peer DMA, called controller memory buffers
(CMB), but this is limited to NVMe-based devices and does not help with CPU consistency. To sum up, (D is assuming
(D2M) Device writes into Rx Queue (D2M) Device writes into Rx Queue
(M2H) Host copies out buffer (M2H) Host copies out buffer
(H2M) Host writes into Tx Queue (H2M) Host writes into Tx Queue
(M2D) Device reads from Tx Queue (M2D) Device reads from Tx Queue
Post-CXL this becomes a matter of managing cache ownership throughout the pipeline. The NIC would
write the incoming packet into the Rx buffer. The CPU would likely copy it out so as to prevent
blocking future packets from coming in. Once done, the CPU has the buffer in its cache. The packet
information could be mutated all in the cache, and then delivered to the Tx queue for sending out.
Since the NIC may decide to mutate the packet further before going out, it'd issue the RdOwn opcode
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 14/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
After CXL, it becomes a matter of managing buffer ownership throughout the pipeline. NIC will write incoming packets into Rx
buffer. The CPU may copy it out to prevent future packets from being blocked. Once completed, the CPU cache
With this buffer. The information of the data packet can be completely mutated in the buffer and then sent to the Tx queue for transmission. because
The NIC may decide to modify the packet further before sending it, it will issue the RdOwn opcode (3.2.4.1.7) and it will be valid from that point on
(D2M) Device writes into Rx Queue (D2M) Device writes into Rx Queue
(M2H) Host copies out buffer (M2H) Host copies out buffer
(H2D) Host transfers ownership into Tx Queue (H2D) Host transfers ownership into Tx Queue
With accelerators that don't have the possibility of causing backpressure like the Rx queue does, step 2
could be removed.
If the accelerator has no potential for backpressure like the Rx queue does, then step 2 can be removed.
Type 2 devices are mandated to support both data protocols and as such, must implement their own
DCOH engine (this will vary in complexity based on the underlying device's cache hierarchy
complexity). One can think of this problem the same way as multiple CPUs where each has their own
L1/L2, but a shared L3 (like Intel CPUs have, where L3 is LLC). Each CPU would need to track
transitions between the local L1 and L2, and the L3 to the global L3. TL;DR on this is, forÿType
2ÿdevices, there's a relatively complex flow to manage local cache state on the device in relation to
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 15/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
Class 2 devices are required to support both data protocols and, therefore, must implement their own DCOH engine (which will depend on the underlying
The complexity of the device's cache hierarchy varies). We can think of this problem as a problem of multiple CPUs, each
Each CPU has its own L1/L2, but there is a shared L3 (like Intel CPUs, L3 is LLC). Required for every CPU
Track transitions between local L1 and L2, and L3 to global L3 transitions. Simply put, for Class 2 devices, there is a corresponding
There is a complex process for managing local cache state on the device, relative to the host-connected memory they use.
In a pre-CXL world, if a device wants to access its own memory, caches or no, it would have the logic to
do so. For example, in GPUs, the sampler generally has a cache. If you try to access texture data via the
sampler that is already in the sampler cache, everything remains internal to the device. Similarly, if the
CPU wishes to modify the texture, an explicit command to invalidate the GPUs sampler cache must
be issued before it can be reliably used by the GPU (or flushed if your GPU was modifying the
texture).
In the pre-CXL world, if a device wanted to access its own memory, cached or not, it had this logic
Edit. For example, in GPUs, samplers usually have a cache. If you try to access via a sampler that is already in the sampler cache
texture data, then everything will remain internal to the device. Likewise, if the CPU wishes to modify a texture, the GPU can reliably
Before using it, you must issue an explicit command to invalidate the GPU's sampler cache (this is required if your GPU is modifying textures.
to refresh).
Continuing with this example in the post-CXL world, the texture lives in graphics memory on the card,
and that graphics memory is participating in CXL.mem protocol. That would imply that should the
CPU want to inspect, or worse, modify the texture it can do so in a coherent fashion. Later inÿType
3ÿdevices, we'll figure out how none of this needs to be that complex for memory expanders.
Continuing this example in the post-CXL era, textures exist in the video memory of the graphics card, and the video memory participates in the CXL.mem protocol. This means
So, if the CPU wants to check, or worse, modify the texture, it can do so in a coherent way. In the third category
Later on, we'll figure out that for memory expanders, none of this needs to be that complicated.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 16/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
These are your memory modules. They provide memory capacity that's persistent, volatile, or a
combination.
These are your memory modules. They provide persistent, volatile, or combined memory capacity.
Even though a Type 2 device could technically behave as a memory expander, it's not ideal to do so. The
nature of a Type 2 device is that it has a cache which also needs to be maintained. Even with meticulous
use of bias controls, extra invalidations and flushes will need to occur, and of course, extra gates are
needed to handle this logic. The host CPU does not know a Type 2 device has no cache. To address this,
the CXL 2.0 specification introduces a new type, Type 3, which is a "dumb" memory expander
device. Since this device has no visible caches (because there is no accelerator), a reduced set of
CXL.mem protocol can be used, the CPU will never need to snoop the device, which means the CPU's
cache is the cache of truth. What this also implies is a CXL type 3 device simply provides device-
Type 3 peer to peer is absent from the 2.0 spec, and unlike CXL.cache, it's not as clear to see the
Even though a Class 2 device can technically function as a memory expander, doing so is not ideal. The nature of the second type of equipment is that it
There is a cache, and this cache also needs to be maintained. Even careful use of bias control requires additional invalidation and flushing
New, of course additional gates are needed to handle this logic. The host CPU doesn't know that Class 2 devices don't have cache. In order to solve this problem
To solve the problem, the CXL 2.0 specification introduces a new type, Type 3, which is a "dumb" memory expander device. due to this
The device has no visible cache (because there is no accelerator), so a reduced set of CXL.mem protocols can be used, and the CPU never
The device needs to be snooped, which means the CPU's cache is the cache of truth. This also means that the CXL Type 3 device only provides equipment to the system.
Prepare connected memory for any use. Hot plugging is allowed. There is no type 3 peer device in the 2.0 specification and it is not related to CXL.cache
The difference is that since CXL.mem is a master/slave protocol, its development path is not clear.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 17/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
In a pre-CXL world the closest thing you find to this are a combination of PCIe based NVMe devices
(for persistent capacity), NVDIMM devices, and of course, attached DRAM. Generally, DRAM isn't
available as expansion cards because a single DDR4 DIMM (which is internally dual channel), only has
21.6 GB/s of bandwidth. PCIe can keep up with that, but it requires all 16 lanes, which I guess isn't
scalable, or cost effective, or something. But mostly, it's not a good use of DRAM when the platform
based interleaving can yield bandwidth in the hundreds of gigabytes per second.
In the pre-CXL world, the closest thing you could find was PCIe-based NVMe devices (for persistent capacity),
NVDIMM devices and of course additional DRAM. Generally speaking, DRAM cannot be used as an expansion card because a single
DDR4 DIMM (dual channel internally), only has a bandwidth of 21.6GB/s. PCIe can keep up with this speed, but it requires all
With 16 channels, I guess that's not scalable, or not cost effective, or something. but the main thing
Yes, this is not a good use of DRAM when platform-based interleaving can produce hundreds of gigabytes per second of bandwidth.
In a post-CXL world the story is changed in the sense that the OS is responsible for much of the
configuration and this is why Type 3 devices are the most interesting from a software perspective.
Even though CXL currently runs on PCIe 5.0, CXL offers the ability to interleave across multiple
devices thus increasing the bandwidth in multiples by count of the interleave ways. When you take
PCIe 6.0 bandwidth, and interleaving, CXL offers quite a robust alternative to HBM, and can even
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 18/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
In the post-CXL world, the story changes, the operating system is responsible for most of the configuration, which is why from the software perspective
From a perspective, category 3 devices are the most interesting. Although CXL currently runs on PCIe 5.0, CXL provides interleaving across multiple devices.
The ability to increase bandwidth by multiples in an interleaved fashion. When you use PCIe 6.0 bandwidth and interleaving, CXL provides quite strong
A large HBM replacement that even scales to GPU-level memory bandwidth and DDR.
This would apply to Type 3 devices, but technically could also apply to Type 2 devices.
This would apply to Category 3 devices, but could technically apply to Category 2 devices as well.
Even though the protocols and use-cases should be understood, the devil is in the details with software
enabling. Type 1 and Type 2 devices will largely gain benefit just from hardware; perhaps some flows
might need driver changes, ie. reducing flushes and/or copies which wouldn't be needed. Type 3
Although the protocol and usage should be understood, the magic is in the details of software enablement. Class 1 and 2 devices will be
There are benefits to be gained from the hardware; perhaps some traffic may require driver changes, i.e. reducing flushing and/or copying, which is not required.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 19/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
Type 3 devices will need host physical address space allocated dynamically (it's not entirely unlike
And last but not least, those devices will need to be maintained using a spec defined mailbox
interface.
A third category of devices will require dynamic allocation of host physical address space (this is not entirely different from memory hot-plugging, but is in some ways more
thorn). The device will need to be programmed to accept these addresses. Last but not least, these devices will require the use of a
The next chapter will start in the same way the driver did, the mailbox interface used for device
The next chapter starts with the driver, the mailbox interface for device information and configuration.
Summary
CXL.cache allows the CPU and device to operate uniformly on the host cache.
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 20/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview
Class 3 devices provide a subset of CXL.mem for use with memory expanders.
Related:
-ÿhttps://www.computeexpresslink.org/download-the-specification
-ÿhttps://www.computeexpresslink.org/resource-library
-ÿhttps://developers.redhat.com/blog/2016/03/01/reducing-memory-access-times-with-caches#
-ÿhttps://en.wikipedia.org/wiki/Bus_snooping#Snoop_filter
https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 21/21