Zstd Codec on the GPU #2863

akshaysubr · 2025-02-25T16:15:20Z

This PR adds a Zstd codec that runs on the GPU using the nvCOMP 4.2 python APIs.

TODO:

Make fully async
Performance benchmarking
CPU-GPU roundtrip testing
Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

src/zarr/codecs/gpu.py

dstansby · 2025-02-28T09:09:22Z

Thanks for opening this PR! At the moment we do not have any codecs implemented in the zarr-python package, but instead store them in numcodecs. So although it looks like there are some zarr-python specific changes that are needed to support the new GPU codec, the actual codec should be implemented in numcodecs, and then imported in zarr-python.

akshaysubr · 2025-02-28T16:00:06Z

@dstansby My understanding was that numcodecs is the place that python bindings to native codec implementations live and that with v3, the Codec class itself lives in zarr-python. The GPU codecs and python bindings are implemented in nvCOMP and imported through the nvidia-nvcomp-cu12 python package so I'm not sure which part of this would need to go in numcodecs. What did you have in mind?

src/zarr/codecs/gpu.py

…hread

TomAugspurger

Looks nice overall. I think the async side of things ended up in a pretty good spot. The code itself is pretty easy to follow (a normal stream synchronize). Having to do that on another host thread is a bit unfortunate, but there's only one synchronize per batch so this should be fine.

I left comments on a few things to clean up that I can help out with if you want @akshaysubr.

src/zarr/codecs/gpu.py

TomAugspurger · 2025-07-07T11:38:18Z

src/zarr/codecs/gpu.py

+    checksum: bool = False
+
+    def __init__(self, *, level: int = 0, checksum: bool = False) -> None:
+        # TODO: Set CUDA device appropriately here and also set CUDA stream


Agreed with leaving devices / streams as a TODO for now.

I want to enable users to overlap host-to-device memcpys with compute operations (like decode, but their own compute operations as well), but I'm not sure yet what that API will look like.

If you have any thoughts on how best to do this I'd love to hear them, and write them up as an issue.

#3271 for planning on devices and streams.

TomAugspurger · 2025-07-07T11:41:03Z

src/zarr/codecs/gpu.py

+        chunks_and_specs: Iterable[tuple[Buffer | None, ArraySpec]],
+    ) -> Iterable[Buffer | None]:
+        return [
+            spec.prototype.buffer.from_array_like(cp.array(a, dtype=np.dtype("b"), copy=False))


This is one spot where @weiji14's idea to use dlpack in #2658 (comment) would help. If NDBuffer new how to consume objects implementing the dlpack protocol, we could (maybe) get rid of the cp.array

src/zarr/codecs/gpu.py

TomAugspurger · 2025-07-09T19:30:53Z

Fixed some merge conflicts and changed the _convert_{to,from}_nvcomp_arrays methods to be sync.

TomAugspurger · 2025-07-09T19:38:38Z

src/zarr/core/buffer/gpu.py

@@ -59,7 +59,7 @@ def __init__(self, array_like: ArrayLike) -> None:

        if array_like.ndim != 1:
            raise ValueError("array_like: only 1-dim allowed")
-        if array_like.dtype != np.dtype("B"):
+        if array_like.dtype.itemsize != 1:


The new tests in test_nvcomp.py were failing without this change.

I'd like to get us to a point where we don't care as much about the details of the buffer passed in here. This is an OK start I think.

what exactly does this check for? It's not clear to me why any numpy array that can be viewed as bytes wouldn't be allowed in here

and same for the dimensionality check, since any N-dimensional numpy array can be viewed as a 1D array.

Yeah, I'm not really sure...

I agree that the actual data we store internally here needs to be a byte dtype. Just doing cp.asarray(input).view("b") seems pretty reasonable to me.

i'm not even convinced that we need Buffer / NDBuffer, when Buffer is just a special case of NDBuffer where there's 1 dimension and the data type is bytes

we could even express this formally by:

making NDBuffer generic with two type parameters (number of dimensions and dtype)

having APIs that insist on consuming a Buffer instead insist on consuming NDBuffer[Literal[1], np.uint8]

(super out of scope for this PR ofc)

It's not clear to me why any numpy array that can be viewed as bytes wouldn't be allowed in here

I think this is mainly because NDBuffer objects don't need to be contiguous, but Buffer objects must be contiguous in memory which might be important when we send those out to codecs that expect a contiguous memory slice.

But I agree that we can probably merge those two and make Buffer a specialization of NDBuffer.

codecov · 2025-07-09T19:59:31Z

Codecov Report

❌ Patch coverage is 58.02469% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.68%. Comparing base (a0c56fb) to head (090349c).

Files with missing lines	Patch %	Lines
src/zarr/codecs/gpu.py	54.92%	32 Missing ⚠️
src/zarr/codecs/__init__.py	0.00%	1 Missing ⚠️
src/zarr/core/buffer/gpu.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2863      +/-   ##
==========================================
- Coverage   60.68%   60.68%   -0.01%     
==========================================
  Files          78       79       +1     
  Lines        9356     9424      +68     
==========================================
+ Hits         5678     5719      +41     
- Misses       3678     3705      +27

Files with missing lines	Coverage Δ
src/zarr/core/array.py	`69.71% <100.00%> (+0.07%)`	⬆️
src/zarr/core/config.py	`29.16% <ø> (ø)`
src/zarr/codecs/__init__.py	`0.00% <0.00%> (ø)`
src/zarr/core/buffer/gpu.py	`46.05% <87.50%> (+4.28%)`	⬆️
src/zarr/codecs/gpu.py	`54.92% <54.92%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomAugspurger

Thanks @akshaysubr, I think this is in a good state.

I'll open up a follow up issue to discuss how to handle devices and streams. I'll leave this open for a few days if anyone is interested in taking a look, and we can hopefully merge it next week.

dstansby

Thanks for the really nice work here! I'm afraid I'm going to mark this as request changes, because we still need to work out what our policy on including new codecs is - should they be included in numcodecs, or should they be included directly in zarr-python. There's a few reason we need to work this out before adding new codecs:

For developers, make it clear where to contribute new codecs (I know someone interested in implementing blosc2 at the moment for example)
For users, we need a clear story/documentation about where codecs live. If the answer is across both numcodecs and zarr-python, then we need clear cross linking documentation to make sure this is clear.

I've opened an issue at #3272 to discuss this; once that's resolved we can either continue here or move over to numcodecs.

I left some comments inline - might be worth waiting for the above to be resolved before looking at them?

Finally, this should also get some documentation in the user guide about how to swap between using the CPU and GPU when both are an option.

thanks again for the work - it's super appreciated, and I'm sorry that merging is going to be a bit delayed until we work out what our long term plan with implementing new codecs is.

src/zarr/codecs/gpu.py

dstansby · 2025-07-21T09:43:11Z

src/zarr/core/array.py

    """
-    return (ZstdCodec(),)
+    return (cast(BytesBytesCodec, get_codec_class("zstd")()),)


Why is the extra cast needed now?

get_codec_class returns type[Codec] but this function specifically returns a tuple[BytesBytesCodec].

src/zarr/core/buffer/gpu.py

TomAugspurger · 2025-07-21T11:53:48Z

Akshay responded on this point in February(!) @dstansby could you at least respond to that directly: that this is a wrapper around a codec engine, in exactly the same way that https://github.com/zarr-developers/zarr-python/blob/abbdbf2be70a24e7d662b5ed449c68f6718977f9/src/zarr/codecs/zstd.py is?

Finally, this should also get some documentation in the user guide about how to swap between using the CPU and GPU when both are an option.

That's already included in gpu.rst.

d-v-b · 2025-07-21T13:37:11Z

where is the source code for the nvcomp zstd implementation, and the python bindings?

TomAugspurger · 2025-07-21T13:44:25Z

where is the source code for the nvcomp zstd implementation, and the python bindings?

I don't believe that the source code for those are published publicly.

d-v-b · 2025-07-21T13:46:35Z

where is the source code for the nvcomp zstd implementation, and the python bindings?

I don't believe that the source code for those are published publicly.

That's potentially quite problematic. We recently had problems relating to skew across zstd implementations. If we cannot inspect the source code for this codec, and we cannot submit patches, then I would definitely not be interested in experiencing bugs from it.

TomAugspurger · 2025-07-21T13:53:46Z

If we cannot inspect the source code for this codec, and we cannot submit patches, then I would definitely not be interested in experiencing bugs from it.

Is that you speaking as a user of zarr or a maintainer?

From the user perspective, this will only be used if you activate it.

From the maintainer's perspective, I'd hope that through documentation and clear error messages we can pinpoint the issue for users.

dstansby · 2025-07-21T14:16:38Z

Akshay responded on this point #2863 (comment) @dstansby could you at least respond to that directly: that this is a wrapper around a codec engine, in exactly the same way that abbdbf2/src/zarr/codecs/zstd.py is?

Sorry, I missed that. And sorry for the slow reply - I am not paid to work on zarr-python, and review stuff on a best efforts basis.

re. where stuff lives, I did not realise that we had codec classes implemented directly in zarr-python. Indeed, our migration guide says (!!)

zarr.codecs has gone, use numcodecs instead

Given that codecs classes are in zarr-python already, and given the direction of conversation in #3272, I'm happy to put this in zarr-python.

That's already included in gpu.rst.

Am I right in thinking that GPU arrays and now decompression are a global option, not a per-array or a per-operation configuration? If so it would be good to clarify in gpu.rst that a) this is only a global option b) if it can be changed during a python session, and c) how decompression of GPU buffers is handled for codecs without a GPU implementation (I presume it just falls back to the CPU impelementation)

re. @d-v-b's point about about skew across zstd implementations, I think this is a very important point. I'm personally on the fence about depending on a closed source library, but as a minimum I think there should be tests that compare the compression/decompression with the (open source I presume?) CPU implementation and the new GPU implementation, and make sure the results are the same.

d-v-b · 2025-07-21T14:23:27Z

Is that you speaking as a user of zarr or a maintainer?

Both? Zarr users and maintainers routinely expose bugs in libraries we depend on, and we routinely forward issues to those libraries, and occasionally contribute patches. If users report a bug or problem in nvcomp, what do we do? Suppose you or akshay leave nvidia, should I email support@nvidia.com when we have problems?

TomAugspurger · 2025-07-21T21:56:15Z

Sorry, I missed that. And sorry for the slow reply - I am not paid to work on zarr-python, and review stuff on a best efforts basis.

Likewise :/

Am I right in thinking that GPU arrays and now decompression are a global option, not a per-array or a per-operation configuration?

I'm not sure... In AsyncArray.getitem there is a prototype argument that seems to be used for the buffers. I'm not sure whether the intent is for that to control all intermediate buffers or just the user-facing output buffer. The whole codec pipeline is still a bit of a black box to me.

I think that https://zarr.readthedocs.io/en/stable/user-guide/config.html is the relevant documentation here:

For selecting custom implementations of codecs, pipelines, buffers and ndbuffers, first register the implementations in the registry and then select them in the config.

This isn't adding any new concepts: just a new implementation that uses the existing configuration system (which I think is fair to summarize as a global managed through our configuration system).

if it can be changed during a python session,

Sure... It's the same as any other configuration option. We don't explicitly show using any option in a context manager, but we do link to https://github.com/pytroll/donfig, which documents that behavior. I can duplicate that in our documentation if you want.

and c) how decompression of GPU buffers is handled for codecs without a GPU implementation (I presume it just falls back to the CPU implementation)

Yeah, it's following the documented behavior: the codec class associated with a given codec ID is used. We can repeat that in the GPU documentation and cross-reference the configuration docs.

So from a generic codec side of things, it'll just use whatever codec implementation is associated with that codec name. From a GPU-specific side of things, we'll want to document which codecs are currently implemented (I imagine the remaining codecs will be a relatively straightforward refactor based on top of what this branch implements).

If users report a bug or problem in nvcomp, what do we do?

An issue filed at https://github.com/NVIDIA/CUDALibrarySamples would be best (I personally would ask the reporter to submit that rather than filing on their behalf, but you're probably more patient than me 😄).

I think there should be tests that compare the compression/decompression

I'll add those to this PR.

TomAugspurger · 2025-07-28T12:38:18Z

I think there should be tests that compare the compression/decompression

https://github.com/zarr-developers/zarr-python/pull/2863/files#diff-b4b3e1bcdd519d839ba90f79f062ef0147362dbac854af26f871b53796791f21R69 has that.

I believe that all the comments have been addressed now but let me know if I missed anything.

I'm unsure why codecov is reporting that the coverage dropped here. Locally I see 100% coverage for codec/gpu.py and the lines changed in core/buffer/gpu.py changed in this PR are covered.

d-v-b · 2025-07-28T12:50:14Z

I confess that I'm worried about retracing the steps that led to a large number of poorly-maintained storage backends in zarr-python 2. I'm not saying that this functionality will be poorly maintained, but it is a risk, and I think we need to make a decision as a project about how we weigh that risk. For this particular feature, what would it look like if we spun it off into a separate repo under the zarr-developers org?

TomAugspurger · 2025-07-28T14:56:45Z

For this particular feature, what would it look like if we spun it off into a separate repo under the zarr-developers org?

Let's see:

For users, I think it could in principle be mostly seamless. We have entry points for loading a codec class without requiring users to import it. If this were in a separate package, I'd push for the gpu extra to depend on it, so that getting zarr-python with GPU support is still just pip install zarr[gpu] (this does induce a circular dependency, but it's in an extra so maybe that's not so bad).

We'd need some system to ensure that zarr.config.enable_gpu() has the ability to ask 3rd-party packages what changes to make. I don't think that system currently exists and so would need to be designed and implemented.

For maintainers, it'd be additional overhead from a separate project (CI, packaging, docs), which is minor but non-negligible.

My main question: is zarr-python really willing to commit to this as a stable interface, to the degree that third parties can build on it without worrying about things breaking? I worry that we don't know enough about the codec interface and its interaction with buffers yet to really commit to that (e.g. all your suggestions about buffers earlier in the thread).

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 25, 2025

TomAugspurger reviewed Feb 25, 2025

View reviewed changes

src/zarr/codecs/gpu.py Outdated Show resolved Hide resolved

akshaysubr added 2 commits February 25, 2025 21:30

First working version of Zstd codec on the GPU

49d5ee8

Adding nvcomp to the GPU dependency list

d548adc

akshaysubr force-pushed the gpu-codecs branch from d6608e7 to d548adc Compare February 25, 2025 21:30

Updating codec pipeline batch size for GPU codecs to enable parallelism

a8c0db3

madsbk mentioned this pull request Feb 28, 2025

Multi-GPU setup and python/zarr api rapidsai/kvikio#651

Closed

This was referenced Mar 3, 2025

Support zarr-python 3.x rapidsai/kvikio#646

Merged

Tracking upstream changes pangeo-data/ncar-hackathon-xarray-on-gpus#28

Open

weiji14 mentioned this pull request Mar 11, 2025

Kvikio backend entrypoint with Zarr v3 xarray-contrib/cupy-xarray#70

Draft

6 tasks

TomAugspurger reviewed Mar 18, 2025

View reviewed changes

src/zarr/codecs/gpu.py Show resolved Hide resolved

vyasr mentioned this pull request Apr 7, 2025

Evaluate proper home for Zarr bindings in kvikio rapidsai/kvikio#686

Open

weiji14 mentioned this pull request Apr 7, 2025

Request for packaging Python bindings conda-forge/nvcomp-feedstock#21

Closed

TomAugspurger reviewed Jun 9, 2025

View reviewed changes

src/zarr/codecs/gpu.py Show resolved Hide resolved

src/zarr/codecs/gpu.py Outdated Show resolved Hide resolved

src/zarr/codecs/gpu.py Show resolved Hide resolved

akshaysubr added 2 commits July 3, 2025 11:59

Making encode and decode async

69aa274

Removing custom awaitable in favor of event synchronize in an async t…

10e1bc9

…hread

TomAugspurger reviewed Jul 7, 2025

View reviewed changes

TomAugspurger added 2 commits July 9, 2025 12:28

Merge remote-tracking branch 'upstream/main' into gpu-codecs

771c0c1

Sync convert methods

ec07100

TomAugspurger added 2 commits July 9, 2025 12:36

test coverage

d1c37a3

loosen dtype restriction

69ea74e

TomAugspurger reviewed Jul 9, 2025

View reviewed changes

TomAugspurger marked this pull request as ready for review July 9, 2025 19:38

fixed Buffer.__add__

1b85fdc

TomAugspurger added 2 commits July 11, 2025 12:30

Added whatsnew

f5c7814

Merge remote-tracking branch 'upstream/main' into gpu-codecs

7671274

TomAugspurger added 5 commits July 18, 2025 10:03

Test coverage for uninitialized chunks

2282cb9

coverage

c6460b5

doc update

3b5e294

lint

dd825dc

@gpu_test

f89b232

TomAugspurger approved these changes Jul 18, 2025

View reviewed changes

TomAugspurger mentioned this pull request Jul 18, 2025

Support CUDA streams, devices #3271

Open

dstansby mentioned this pull request Jul 21, 2025

Add a policy on where to implement new codecs #3272

Open

dstansby requested changes Jul 21, 2025

View reviewed changes

TomAugspurger added 3 commits July 21, 2025 14:57

wip test stuff

7a4b037

doc updates

398b4d1

added failing compatibility test

dd69543

weiji14 mentioned this pull request Jul 24, 2025

Use nvcomp python bindings from conda-forge instead of PyPI pangeo-data/ncar-hackathon-xarray-on-gpus#36

Open

TomAugspurger added 6 commits July 24, 2025 07:37

added a matching test

76f7560

Some buffer coverage

8b5b3f1

coverage

7af3a16

update error message

996fbc0

private

d24d027

Merge remote-tracking branch 'upstream/main' into gpu-codecs

090349c

Uh oh!

Zstd Codec on the GPU #2863

Are you sure you want to change the base?

Zstd Codec on the GPU #2863

Uh oh!

Conversation

akshaysubr commented Feb 25, 2025

Uh oh!

Uh oh!

dstansby commented Feb 28, 2025

Uh oh!

akshaysubr commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Jul 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

dstansby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger commented Jul 21, 2025

Uh oh!

d-v-b commented Jul 21, 2025

Uh oh!

TomAugspurger commented Jul 21, 2025

Uh oh!

d-v-b commented Jul 21, 2025

Uh oh!

TomAugspurger commented Jul 21, 2025

Uh oh!

dstansby commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov bot commented Jul 9, 2025 •

edited

Loading

dstansby commented Jul 21, 2025 •

edited

Loading