0% found this document useful (0 votes)
45 views21 pages

Compute Express Link Overview

CXL defines 3 protocols on top of PCIe to enable coherent but fast memory access and efficient CPU processing of device data. One protocol, CXL.cache, addresses the need for coherent fast memory by allowing CPU caches to extend into device memory. Modern PCIe provides cache coherence for DMA but is not optimal due to high CPU overhead. CXL aims to improve on this with more efficient coherence mechanisms.

Uploaded by

kotresh t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views21 pages

Compute Express Link Overview

CXL defines 3 protocols on top of PCIe to enable coherent but fast memory access and efficient CPU processing of device data. One protocol, CXL.cache, addresses the need for coherent fast memory by allowing CPU caches to extend into device memory. Modern PCIe provides cache coherence for DMA but is not optimal due to high CPU overhead. CXL aims to improve on this with more efficient coherence mechanisms.

Uploaded by

kotresh t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Translated by Google

8/13/22, 10:10 AM Compute Express Link Overview

2022-06-01

Intro

Compute Express Link (CXL) is the next spec of significance for connecting hardware devices. It will

replace or supplement existing stalwarts like PCIe. The adoption is starting in the datacenter, and the

specification definitely provides interesting possibilities for client and embedded devices. A few years

ago, the picture wasn't so clear. The original release of the CXL specification wasn't Earth shattering.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 1/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

There were competing standards with intent hardware vendors behind them. The drive to, and

release of, theÿCompute Express Link 2.0ÿspecification changed much of that.

Compute Express Link (CXL) is the next big specification for connecting hardware devices. It will replace or supplement the existing backbone

quantity, such as PCIe. Data centers are starting to adopt it, and the specification certainly offers interesting possibilities for clients and embedded devices.

A few years ago, the situation was not so clear. The initial release of the CXL specification was not earth-shattering. There were a number of competing bids at the time.

Yes, there are dedicated hardware vendors behind these standards. The push and release of the Compute Express Link 2.0 specification changes

This situation.

There are a bunch ofÿreally great materialsÿhosted by the CXL consortium. I find that these are primarily

geared toward use cases, hardware vendors, and sales & marketing. This blog series will dissect CXL

with the scalpel of a software engineer. I intend to provide an in depth overview of the driver

architecture, and go over how the development was done before there was hardware. This post will

go over the important parts of the specification as I see them.

The CXL Alliance provides a lot of great material. I found that this information is mainly geared towards use cases, hardware vendors, and sales

and marketing. This blog series will use the skills of a software engineer to dissect CXL. I intend to provide an in-depth look at driver architecture

Get an overview and review how development was done before hardware. This article will introduce the important parts of the specification that I have seen.

point.

All spec references are relative to the 2.0 specification which can be obtainedÿhere.

All specification references are relative to the 2.0 specification, available here.

What it is

Let's start with two practical examples.

Let's start with two practical examples.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 2/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

If one were creating a PCIe based memory expansion device today, and desired that device expose

coherent byte addressable memory there would really be only two viable options.

One could expose this memory mapped input/output (MMIO) via Base Address Register

(BAR). Without horrendous hacks, and to have ubiquitous CPU support, the only sane way this can

work is if you map the MMIO as uncached (UC), which has aÿdistinct performance penalty. For

more details on coherent memory access to the GPU, see myÿprevious blog post. Access to the

device memory should be fast (at least, not limited by the protocol's restriction), and we haven't

managed to accomplish that. In fact, theÿNVMe 1.4 specificationÿintroduces the Persistent

Memory Region (PMR) which sort of does this but is still limited.

If someone were to create a PCIe-based memory expansion device today and wanted the device to expose relative byte-addressable

memory, then there are really only two viable options. One can expose this via the Base Address Register (BAR)

Memory mapped input/output (MMIO). If there are no terrible hacking techniques and universal CPU support, the only solution is

A reasonable approach is to map MMIO as Uncached (UC), which has a significant impact on performance. About consistency with GPU

For more details on sexual memory access, please see my previous blog post. Access to device memory should be fast (at least, not subject to

limitations of the protocol), and we have not yet managed to accomplish this goal. In fact, the NVMe 1.4 specification introduced persistent memory

Region (PMR), it can do this, but is still limited.

If one were creating a PCIe based device whose main job was to do Network Address

Translation (NAT) (or some other IP packet mutation) that was to be done by the CPU,

critical memory bandwidth would be need for this. This is because the CPU will have to

read data in from the device, modify it, and write it back out and the only way to do this

via PCIe is to go through main memory. (More on this below

If you create a PCIe-based device, its main job is to do Network Address Translation (NAT) (or some other IP number

packet mutation), this will be done by the CPU and requires critical memory bandwidth for this. This is because the CPU will have to pull the

Read the data, modify it, and write it back, and the only way to do it over PCIe is through main memory. (Down

More on this here

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 3/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

CXL defines 3 protocols that work on top of PCIe that enable (Chapter 3 of the CXL 2.0 specification) a

general purpose way to implement the examples. Two of these protocols help address our 'coherent

but fast' problem above. I'll call them the data protocols. The third, CXL.io can be thought of as a

stricter set of requirements over PCIe config cycles. I'll call that the enumeration/configuration

protocol. We'll not discuss that in any depth as it's not particularly interesting.

CXL defines 3 protocols that work on top of PCIe and can (Chapter 3 of the CXL 2.0 specification) be used in a common way

formula to implement these examples. Two of these protocols help solve our "similar but fast" problem above. I call them

for the data protocol. Third, CXL.io can be considered a more stringent set of requirements for the PCIe configuration cycle. I call it a piece

Lift/Configure protocol. We won't go into this issue in depth because it's not particularly interesting.

There's plenty of great overviews such asÿthis one. The point of this blog is to focus on the specific

aspects driver writers and reviewers might care about.

There are many good overviews, such as this one. The point of this blog is to focus on specific aspects that driver writers and commentators may be concerned about.

ÿÿ

CXL.cache

But first, a bit on PCIe coherency. Modern x86 architectures have cache coherent PCIe DMA. For

DMA reads this simply means that the DMA engine obtains the most recent copy of the data by

requesting it from the fabric. For writes, once the DMA is complete the DMA engine will send an

invalidation request to the host(s) to invalidate the range that was DMA'd. Fundamentally however,

using this is generally not optimal since keeping coherency would require the CPU basically snoop

the PCIe interconnect all the time. This would be bad for power and performance. As such, general

drivers manage coherency via software mechanisms. There are exceptions to this rule, but I'm not

intimately familiar with them, so I won't add more detail.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 4/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

But first, let’s introduce PCIe compliance. Modern x86 architectures have cache-coherent PCIe DMA. For DMA reads, this

Just means that the DMA engine gets the latest copy of the data by requesting it from the structure. For writes, once the DMA is complete, the DMA

The engine will send an invalidation request to the host to invalidate the DMAed range. Fundamentally, however, using this approach

Typically not the best option since maintaining consistency requires the CPU to keep snooping on the PCIe interconnect. This is detrimental to both power consumption and performance

Profitable. Therefore, typical drivers manage consistency through software mechanisms. There are exceptions to this rule, but I'm not familiar with them

It's already known, so I won't add any more details.

CXL.cache is interesting because it allows the device to participate in the CPU cache coherency

protocol as if it were another CPU rather than being a device.

From a software perspective, it's the less interesting of the two data protocols. Chapter 3.2.x has a lot of

words around what this protocol is for, and how it is designed to work.ÿ

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 5/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

This protocol is targeted towards accelerators which do not have anything to provide to the system in

terms of resources, but instead utilize host-attached memory and a local cache.

The CXL.cache protocol if successfully negotiated throughout the CXL topology, host to endpoint,

should just work. It permits the device to have coherent view of memory without software

intervention and on x86, without the potential negative ramifications of the snoops.ÿ

Similarly, the host is able to read from the device caches without using main memory as a stopping

point.ÿ

Main memory can be skipped and instead data can be directly transferred over CXL.cache protocol.ÿ

The protocol describesÿsnoop filteringÿand the necessary messages to keep coherency.

As a software person, I consider it a more efficient version of PCIe coherency, and one which

transcends x86 specificity.

CXL.cache is interesting because it allows the device to participate in the CPU cache coherence protocol as if it were another CPU instead of

A device. From a software perspective, it is the less interesting of the two data protocols. Chapter 3.2.x surrounding this protocol

There's a lot to be said for its purpose, and the way it's designed to work. This protocol is aimed at those resources that are not provided to the system.

An accelerator for anything that instead utilizes host-connected memory and local cache. CXL.cache protocol if the entire CXL topology

Successfully negotiated in the structure, from the host to the terminal, it should be able to work. It allows the device to communicate internally without software intervention

There is a consensus that on x86 there are no potential snooping side effects. Likewise, the host can read from the device cache

fetch without using main memory as a stopping point. Main memory can be skipped, and data can be transferred directly through the CXL.cache protocol

lose. This protocol describes snoop filtering and necessary information to maintain consistency. As a software person, I think this is PCIe

a more efficient version of consistency and goes beyond the specificities of x86.

Protocol

CXL.cache has a bidirectional request/response protocol where a request can be made from host to

device (H2D) or vice-versa (D2H). The set of commands are what you'd expect to provide proper

snooping. For example, H2D requests with one of the 3 Snp* opcodes defined in 3.2.4.3.X, these allow

to gain exclusive access to a line, shared access, or just get the current value; while the device uses

one of several commands in Table 18 to read/write/invalidate/flush (similar uses).

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 6/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

CXL.cache has a two-way request/response protocol, a request can be from host to device (H2D) or vice versa

(D2H). This set of commands is what you would expect to provide proper snooping. For example, H2D uses 3 Snp* defined in 3.2.4.3.X

A request in an opcode, these allow to get exclusive access to a line, shared access, or just get the current value; set

Prepare to read/write/invalidate/flush (similar uses) using one of several commands in Table 18.

One might also notice that the peer to peer case isn't covered. The CXL model however makes every

device/CPU a peer in the CXL.cache domain. While the current CXL specification doesn't address

skipping CPU caches in this matter entirely, it'd be a safe bet to assume a specification so

comprehensive would be getting there soon. CXL would allow this more generically than NVMe.

One may also notice that the peer-to-peer scenario is not covered. However, the CXL model makes every device/CPU a CXL.cache

Peers in the domain. Although the current CXL specification does not completely solve the problem of skipping CPU cache, it is certain that if

This comprehensive specification will appear soon. CXL will allow this more generally than NVMe.

To summarize, CXL.cache essentially lets CPU and device caches to remain coherent without

needing to use main memory as the synchronization barrier.

In summary, CXL.cache essentially keeps the CPU and device caches consistent without using main memory as a synchronization barrier.

CXL.mem

If CXL.cache is for devices that don't provide resources, CXL.mem is exactly the opposite. CXL.mem

allows the CPU to have coherent byte addressable access to device-attached memory while

maintaining its own internal cache. Unlike CXL.cache where every entity is a peer, the device or host

sends requests and responses, the CPU, known as the "master" in the CXL spec, is responsible for

sending requests, and the CXL subordinate (device) sends the response. Introduced in CXL 1.1,

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 7/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

CXL.mem was added forÿType 2ÿdevices. Requests from the master to the subordinate are "M2S" and

responses are "S2M".

If CXL.cache is for devices that do not provide resources, then CXL.mem is just the opposite. CXL.m allows the CPU to

Connected memory performs coherent byte-addressed accesses while maintaining its own internal cache. Unlike CXL.cache, each

Each entity is a peer. The device or host sends requests and responses. The CPU (called "master" in the CXL specification) is responsible for sending requests.

request, and the CXL subordinate (device) sends a response. CXL.mem introduced in CXL 1.1 was added for Class 2 devices. from master station

The request to the lower level is "M2S" and the response is "S2M".

When CXL.cache isn't also present, CXL.mem is very straightforward. All requests boil down to a

read, or a write. When CXL.cache is present, the situation is more tricky. For performance

improvements, the host will tell the subordinate about certain ranges of memory which may not

need to handle coherency between the device cache, device-attached memory, and the host cache.

There is also meta data passed along to inform the device about the current cacheline state. Both the

master and subordinate need to keep their cache state in harmony.

When CXL.cache does not exist, CXL.mem is very straightforward. All requests boil down to read or write. When CXL.cache exists

, the situation becomes more difficult. To improve performance, the host will tell subordinates the range of certain memories that may not be needed

Handles consistency between device cache, device-attached memory, and host cache. There is also some metadata passed to inform the device of the current

The previous cache line state. Both the master and slaves need to keep their cache states in harmony.

Protocol

CXL.mem protocol is straight forward, especially when the device doesn't also use CXL.cache (ie. it has

no local cache). The CXL.mem protocol is straightforward, especially when the device does not use CXL.cache (i.e. it has no local cache).

local cache).

The requests are known as

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 8/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

These requests are called

Req - Request without data. These are generally reads and invalidates, where the response will put

the data on the S2M's data channel.

Req - A request with no data. These are usually read and invalid, and the response will put the data on the S2M data channel.

RwD - Request with data. These are generally writes, where the data channel has the data to

write.

RwD - Request with data. These are usually written, and the data channel has data to write.

The responses:

NDR - No Data Response. These are generally completions on state changes on the device side,

such as, writeback complete.

NDR - No data response. These are typically completions of device-side state changes, such as writeback completions.

DRS - Data Response. This is used for returning data, ie. on a read request from the master.

DRS - Data Response. This is used to return data, that is, on a read request from the master site.

Bias controls

Unsurprisingly, strict coherency often negatively impacts bandwidth, or latency, or both. While it's

generally ideal from a software model to be coherent, it likely won't be ideal for performance. The CXL

specification has a solution for this. Chapter 2.2.1 describes a knob which allows a mechanism to

provide the hint over which entity should pay for that coherency (CPU vs. device).ÿ

For many HPC workloads, such as weather modeling, large sets of data are uploaded to an

accelerator's device-attached memory via the CPU, then the accelerator crunches numbers on the

data, and finally the CPU download results. In CXL, at all times, the model data is coherent for both

the device and the CPU. Depending on the bias however, one or the other will take a performance

hit.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 9/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

Not surprisingly, strict consistency often has a negative impact on bandwidth or latency, or both. Although from the software model

It seems that coherence is usually ideal, but may not be ideal for performance. The CXL specification has a solution for this. No.

Chapter 2.2.1 describes a knob that allows a mechanism to provide support for which entity should pay for that consistency (CPU vs. device).

ÿ. For many HPC workloads, such as weather modeling, large data sets are uploaded via the CPU into the accelerator's device connection.

The data is stored, then the accelerator calculates the data, and finally the CPU downloads the results. In CXL, at any time, the model data

The device and CPU are the same. However, depending on the preference, one of them will suffer from performance impacts.

Host vs. device bias

Using the weather modeling example, there are 4 interesting flows.

Using the example of weather modeling, there are 4 interesting streams.

CPU writes data to device-attached memory. CPU writes data to device-attached memory.

GPU reads data from device-attached memory. GPU reads data from device-attached memory.

*GPU writes data to device_attached memory. GPU writes data to the memory connected to the device.

CPU reads data from device attached memory. CPU reads data from device attached memory.

*#3 above poses an interesting situation that was possible only with bespoke hardware. The GPU could

in theory write that data out via CXL.cache and short-circuit another bias change. In practice

though, many such usages would blow out the cache.

#3 above raises an interesting situation that is only possible with custom hardware. In theory, a GPU can pass

CXL.cache writes the data out and short-circuits another bias change. But in practice, many such uses will exhaust the high speed

live.

The CPU coherency engine has been a thing for a long time. One might ask, why not just use that and be

done with it. Well, easy one first, a Device Coherency Engine (DCOH) was already required for

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 10/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

CXL.cache protocol support.

More practically however, the hit to latency and bandwidth is significant if every cacheline access

required a check-in with the CPU's coherency engine.ÿ

What this means is that when the device wishes to access data from this line, it must first determine

the cacheline state (DCOH can track this), if that line isn't exclusive to the accelerator, the

accelerator needs to use CXL.cache protocol to request the CPU make things coherent, and once

complete, then can access it's device-attached memory.

Why is that? If you recall, CXL.cache is essentially where the device is the initiator of the request,

and CXL.mem is where the CPU is the initiator.

Consistency engines for CPUs have been around for a long time. Some people may ask, why not just use this engine and then

That's it. Well, first of all it is simple, the Device Consistency Engine (DCOH) is already required to support the CXL.cache protocol

of. However, more realistically, if every cache line access needs to be checked on the CPU's coherency engine, then the

The impact on latency and bandwidth is huge. This means that when a device wants to access this row of data, it must first determine the cache

The state of the line (DCOH can track this), if this line is not unique to the accelerator, the accelerator needs to use

The CXL.cache protocol is used to request the CPU to make it consistent, and once completed, the device-connected memory can be accessed. This is for

what? If you remember, CXL.cache is essentially the device as the initiator of the request, while CXL.mem is the CPU as the initiator

By.

So suppose we continue on this CPU owns coherency adventure. #1 looks great, the CPU can quickly

upload the dataset. However, #2 will immediately hit the bottleneck just mentioned. Similarly for #3,

even though a flush won't have to occur, the accelerator will still need to send a request to the CPU

to make sure the line gets invalidated. To sum up, we have coherency, but half of our operations are

slower than they need to be.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 11/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

So, let's say we continue on this adventure of CPU having consistency. #1 looks good, the CPU uploads the data quickly

set. However, #2 will immediately hit the bottleneck just mentioned. Likewise for #3, even if a refresh is not required, the accelerator still needs

A request is sent to the CPU to ensure that the row is deprecated. In summary, we have coherence, but half of the operations require

Slower.

To address this, a [fairly vague] description of bias controls is defined. When in host bias mode, the

CPU coherency engine effectively owns the cacheline state (the contents are shared of course) by

requiring the device to use CXL.cache for coherency. In device bias mode however, the host will use

CXL.mem commands to ensure coherency. This is whyÿType 2ÿdevices need both CXL.cache and

CXL.mem.

To solve this problem, a [rather vague] description of bias control is defined. When in host bias mode, the CPU

The consistency engine effectively owns the cache line state by requiring the device to use CXL.cache for consistency (of course the content is shared

of). However, in device-biased mode, the host will use the CXL.mem command to ensure consistency. This is why Class 2 devices

Both CXL.cache and CXL.mem are required.

Device types

I'd like to know why they didn't start numbering at 0. I've already talked quite a bit about device types. I

believe it made sense to define the protocols first though so that device types would make more sense.

CXL 1.1 introduced two device types, and CXL 2.0 added a third.ÿAll types implement CXL.io, the

less than exciting protocol we ignore.

I'm wondering why they don't start numbering from 0, I've talked quite a bit about device types. I believe in defining the protocol first

It makes sense so that the device type will make more sense. CXL 1.1 introduced two device types, while CXL 2.0 added a third

kind. All types implement CXL.io, the less exciting protocol we ignored.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 12/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

Just from looking at the table it'd be wise to ask, ifÿType 2ÿdoes both protocols, why do Type 1 and Type

3 devices exist. In short, gate savings can be had with Type 1 devices not needing CXL.mem, and Type 3

devices offer gate savings and increased performance because they don't have to manage internal cache

coherency. More on this next...

Just looking at the table, we would ask, if type 2 implements both protocols at the same time, why do there still exist types 1 and 3 devices?

exist. In short, the first type of equipment can save the number of gates without requiring CXL.mem, and the third type of equipment can save the number of gates and improve performance.

Yes, because they do not need to manage internal cache consistency. More to come...

CXL Type 1 Devices

These are your accelerators without local memory.

These are accelerators without local memory.

The quintessential Type 1 device is the NIC. A NIC pushes data from memory out onto the wire, or

pulls from the wire and into memory. It might perform many steps, such as repackaging a packet, or

encryption, or reordering packets (I dunno, not a networking person).ÿOur NAT example aboveÿis one

such case.

The most typical Class 1 device is the NIC. The network card pushes data from memory to the line, or pulls data from the line to memory. It may perform licensing

Multiple steps like repackaging packets, or encrypting, or reordering packets (I don't know, not a networker). Let's go

The NAT example above is such a situation.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 13/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

How you might envision that working is the PCIe device would write the incoming packet into the Rx

buffer. The CPU would copy that packet out of the Rx buffer, update the IP and port, then write it into the

Tx buffer. This set of steps would use memory write bandwidth when the device wrote into the Rx buffer,

memory read bandwidth when the CPU copied the packet, and memory write bandwidth when the CPU

writes into the Tx buffer. Again, NVMe has a concept to support a subset of this case for peer to peer

DMA called Controller Memory Buffers (CMB), but this is limited to NVMe based devices, and

doesn't help with coherency on the CPU. Summarizing, (D is device cache, M is memory, H is

Host/CPU cache)

The way you might imagine this works is that the PCIe device writes incoming packets into an Rx buffer. CPU will copy that data from Rx buffer

packet, update the IP and port, and write it to the Tx buffer. When the device writes to the Rx buffer, this set of steps will use memory to write the tape

Wide, when the CPU copies the packet, the memory read bandwidth is used, and when the CPU writes to the Tx buffer, the memory write bandwidth is used. same,

NVMe has a concept that supports a subset of this for peer-to-peer DMA, called controller memory buffers

(CMB), but this is limited to NVMe-based devices and does not help with CPU consistency. To sum up, (D is assuming

Preparation cache, M is memory, H is host/CPU cache)

(D2M) Device writes into Rx Queue (D2M) Device writes into Rx Queue

(M2H) Host copies out buffer (M2H) Host copies out buffer

(H2M) Host writes into Tx Queue (H2M) Host writes into Tx Queue

(M2D) Device reads from Tx Queue (M2D) Device reads from Tx Queue

Post-CXL this becomes a matter of managing cache ownership throughout the pipeline. The NIC would

write the incoming packet into the Rx buffer. The CPU would likely copy it out so as to prevent

blocking future packets from coming in. Once done, the CPU has the buffer in its cache. The packet

information could be mutated all in the cache, and then delivered to the Tx queue for sending out.

Since the NIC may decide to mutate the packet further before going out, it'd issue the RdOwn opcode

(3.2.4.1.7), from which point it would effectively own that cacheline.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 14/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

After CXL, it becomes a matter of managing buffer ownership throughout the pipeline. NIC will write incoming packets into Rx

buffer. The CPU may copy it out to prevent future packets from being blocked. Once completed, the CPU cache

With this buffer. The information of the data packet can be completely mutated in the buffer and then sent to the Tx queue for transmission. because

The NIC may decide to modify the packet further before sending it, it will issue the RdOwn opcode (3.2.4.1.7) and it will be valid from that point on

owns the cache line.

(D2M) Device writes into Rx Queue (D2M) Device writes into Rx Queue

(M2H) Host copies out buffer (M2H) Host copies out buffer

(H2D) Host transfers ownership into Tx Queue (H2D) Host transfers ownership into Tx Queue

With accelerators that don't have the possibility of causing backpressure like the Rx queue does, step 2

could be removed.

If the accelerator has no potential for backpressure like the Rx queue does, then step 2 can be removed.

CXL Type 2 Devices

These are your accelerators with local memory.

These are your accelerators with local memory.

Type 2 devices are mandated to support both data protocols and as such, must implement their own

DCOH engine (this will vary in complexity based on the underlying device's cache hierarchy

complexity). One can think of this problem the same way as multiple CPUs where each has their own

L1/L2, but a shared L3 (like Intel CPUs have, where L3 is LLC). Each CPU would need to track

transitions between the local L1 and L2, and the L3 to the global L3. TL;DR on this is, forÿType

2ÿdevices, there's a relatively complex flow to manage local cache state on the device in relation to

the host-attached memory they are using.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 15/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

Class 2 devices are required to support both data protocols and, therefore, must implement their own DCOH engine (which will depend on the underlying

The complexity of the device's cache hierarchy varies). We can think of this problem as a problem of multiple CPUs, each

Each CPU has its own L1/L2, but there is a shared L3 (like Intel CPUs, L3 is LLC). Required for every CPU

Track transitions between local L1 and L2, and L3 to global L3 transitions. Simply put, for Class 2 devices, there is a corresponding

There is a complex process for managing local cache state on the device, relative to the host-connected memory they use.

In a pre-CXL world, if a device wants to access its own memory, caches or no, it would have the logic to

do so. For example, in GPUs, the sampler generally has a cache. If you try to access texture data via the

sampler that is already in the sampler cache, everything remains internal to the device. Similarly, if the

CPU wishes to modify the texture, an explicit command to invalidate the GPUs sampler cache must

be issued before it can be reliably used by the GPU (or flushed if your GPU was modifying the

texture).

In the pre-CXL world, if a device wanted to access its own memory, cached or not, it had this logic

Edit. For example, in GPUs, samplers usually have a cache. If you try to access via a sampler that is already in the sampler cache

texture data, then everything will remain internal to the device. Likewise, if the CPU wishes to modify a texture, the GPU can reliably

Before using it, you must issue an explicit command to invalidate the GPU's sampler cache (this is required if your GPU is modifying textures.

to refresh).

Continuing with this example in the post-CXL world, the texture lives in graphics memory on the card,

and that graphics memory is participating in CXL.mem protocol. That would imply that should the

CPU want to inspect, or worse, modify the texture it can do so in a coherent fashion. Later inÿType

3ÿdevices, we'll figure out how none of this needs to be that complex for memory expanders.

Continuing this example in the post-CXL era, textures exist in the video memory of the graphics card, and the video memory participates in the CXL.mem protocol. This means

So, if the CPU wants to check, or worse, modify the texture, it can do so in a coherent way. In the third category

Later on, we'll figure out that for memory expanders, none of this needs to be that complicated.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 16/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

CXL Type 3 Devices

These are your memory modules. They provide memory capacity that's persistent, volatile, or a

combination.

These are your memory modules. They provide persistent, volatile, or combined memory capacity.

Even though a Type 2 device could technically behave as a memory expander, it's not ideal to do so. The

nature of a Type 2 device is that it has a cache which also needs to be maintained. Even with meticulous

use of bias controls, extra invalidations and flushes will need to occur, and of course, extra gates are

needed to handle this logic. The host CPU does not know a Type 2 device has no cache. To address this,

the CXL 2.0 specification introduces a new type, Type 3, which is a "dumb" memory expander

device. Since this device has no visible caches (because there is no accelerator), a reduced set of

CXL.mem protocol can be used, the CPU will never need to snoop the device, which means the CPU's

cache is the cache of truth. What this also implies is a CXL type 3 device simply provides device-

attached memory to the system for any use. Hotplug is permitted.

Type 3 peer to peer is absent from the 2.0 spec, and unlike CXL.cache, it's not as clear to see the

path forward because CXL.mem is a Master/Subordinate protocol.

Even though a Class 2 device can technically function as a memory expander, doing so is not ideal. The nature of the second type of equipment is that it

There is a cache, and this cache also needs to be maintained. Even careful use of bias control requires additional invalidation and flushing

New, of course additional gates are needed to handle this logic. The host CPU doesn't know that Class 2 devices don't have cache. In order to solve this problem

To solve the problem, the CXL 2.0 specification introduces a new type, Type 3, which is a "dumb" memory expander device. due to this

The device has no visible cache (because there is no accelerator), so a reduced set of CXL.mem protocols can be used, and the CPU never

The device needs to be snooped, which means the CPU's cache is the cache of truth. This also means that the CXL Type 3 device only provides equipment to the system.

Prepare connected memory for any use. Hot plugging is allowed. There is no type 3 peer device in the 2.0 specification and it is not related to CXL.cache

The difference is that since CXL.mem is a master/slave protocol, its development path is not clear.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 17/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

In a pre-CXL world the closest thing you find to this are a combination of PCIe based NVMe devices

(for persistent capacity), NVDIMM devices, and of course, attached DRAM. Generally, DRAM isn't

available as expansion cards because a single DDR4 DIMM (which is internally dual channel), only has

21.6 GB/s of bandwidth. PCIe can keep up with that, but it requires all 16 lanes, which I guess isn't

scalable, or cost effective, or something. But mostly, it's not a good use of DRAM when the platform

based interleaving can yield bandwidth in the hundreds of gigabytes per second.

In the pre-CXL world, the closest thing you could find was PCIe-based NVMe devices (for persistent capacity),

NVDIMM devices and of course additional DRAM. Generally speaking, DRAM cannot be used as an expansion card because a single

DDR4 DIMM (dual channel internally), only has a bandwidth of 21.6GB/s. PCIe can keep up with this speed, but it requires all

With 16 channels, I guess that's not scalable, or not cost effective, or something. but the main thing

Yes, this is not a good use of DRAM when platform-based interleaving can produce hundreds of gigabytes per second of bandwidth.

In a post-CXL world the story is changed in the sense that the OS is responsible for much of the

configuration and this is why Type 3 devices are the most interesting from a software perspective.

Even though CXL currently runs on PCIe 5.0, CXL offers the ability to interleave across multiple

devices thus increasing the bandwidth in multiples by count of the interleave ways. When you take

PCIe 6.0 bandwidth, and interleaving, CXL offers quite a robust alternative to HBM, and can even

scale to GPU level memory bandwidth with DDR.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 18/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

In the post-CXL world, the story changes, the operating system is responsible for most of the configuration, which is why from the software perspective

From a perspective, category 3 devices are the most interesting. Although CXL currently runs on PCIe 5.0, CXL provides interleaving across multiple devices.

The ability to increase bandwidth by multiples in an interleaved fashion. When you use PCIe 6.0 bandwidth and interleaving, CXL provides quite strong

A large HBM replacement that even scales to GPU-level memory bandwidth and DDR.

Host physical address space management

This would apply to Type 3 devices, but technically could also apply to Type 2 devices.

This would apply to Category 3 devices, but could technically apply to Category 2 devices as well.

Even though the protocols and use-cases should be understood, the devil is in the details with software

enabling. Type 1 and Type 2 devices will largely gain benefit just from hardware; perhaps some flows

might need driver changes, ie. reducing flushes and/or copies which wouldn't be needed. Type 3

devices on the other hand are a whole new ball of wax.

Although the protocol and usage should be understood, the magic is in the details of software enablement. Class 1 and 2 devices will be

There are benefits to be gained from the hardware; perhaps some traffic may require driver changes, i.e. reducing flushing and/or copying, which is not required.

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 19/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

Category 3 devices, on the other hand, are a whole new ball.

Type 3 devices will need host physical address space allocated dynamically (it's not entirely unlike

memory hot plug, but it is trickier in some ways).ÿ

The devices will need to be programmed to accept those addresses.ÿ

And last but not least, those devices will need to be maintained using a spec defined mailbox

interface.

A third category of devices will require dynamic allocation of host physical address space (this is not entirely different from memory hot-plugging, but is in some ways more

thorn). The device will need to be programmed to accept these addresses. Last but not least, these devices will require the use of a

A standardized mailbox interface is maintained.

The next chapter will start in the same way the driver did, the mailbox interface used for device

information and configuration.

The next chapter starts with the driver, the mailbox interface for device information and configuration.

Summary

Important takeaways are as follow:

Important instructions are as follows:

CXL.cache allows CPUs and devices to operate on host caches uniformly.

CXL.cache allows the CPU and device to operate uniformly on the host cache.

CXL.mem allows devices to export their memory coherently.

CXL.mem allows a device to export its memory coherently.

Bias controls mitigate performance penalties of CXL.mem coherency

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 20/21
Machine Translated by Google
8/13/22, 10:10 AM Compute Express Link Overview

Bias control mitigates the performance penalty of CXL.mem consistency.

Type 3 devices provide a subset of CXL.mem, for memory expanders.

Class 3 devices provide a subset of CXL.mem for use with memory expanders.

Related:

-ÿhttps://www.computeexpresslink.org/download-the-specification

-ÿhttps://www.computeexpresslink.org/resource-library

-ÿhttps://developers.redhat.com/blog/2016/03/01/reducing-memory-access-times-with-caches#

-ÿhttps://en.wikipedia.org/wiki/Bus_snooping#Snoop_filter

https://mp.weixin.qq.com/s/eRYHSjNtBOl2nl5bKzg11A 21/21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy