-
-
Notifications
You must be signed in to change notification settings - Fork 350
Zstd Codec on the GPU #2863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Zstd Codec on the GPU #2863
Conversation
d6608e7
to
d548adc
Compare
Thanks for opening this PR! At the moment we do not have any codecs implemented in the |
@dstansby My understanding was that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice overall. I think the async side of things ended up in a pretty good spot. The code itself is pretty easy to follow (a normal stream synchronize). Having to do that on another host thread is a bit unfortunate, but there's only one synchronize per batch so this should be fine.
I left comments on a few things to clean up that I can help out with if you want @akshaysubr.
checksum: bool = False | ||
|
||
def __init__(self, *, level: int = 0, checksum: bool = False) -> None: | ||
# TODO: Set CUDA device appropriately here and also set CUDA stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with leaving devices / streams as a TODO for now.
I want to enable users to overlap host-to-device memcpys with compute operations (like decode, but their own compute operations as well), but I'm not sure yet what that API will look like.
If you have any thoughts on how best to do this I'd love to hear them, and write them up as an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3271 for planning on devices and streams.
src/zarr/codecs/gpu.py
Outdated
chunks_and_specs: Iterable[tuple[Buffer | None, ArraySpec]], | ||
) -> Iterable[Buffer | None]: | ||
return [ | ||
spec.prototype.buffer.from_array_like(cp.array(a, dtype=np.dtype("b"), copy=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one spot where @weiji14's idea to use dlpack in #2658 (comment) would help. If NDBuffer
new how to consume objects implementing the dlpack protocol, we could (maybe) get rid of the cp.array
Fixed some merge conflicts and changed the |
@@ -59,7 +59,7 @@ def __init__(self, array_like: ArrayLike) -> None: | |||
|
|||
if array_like.ndim != 1: | |||
raise ValueError("array_like: only 1-dim allowed") | |||
if array_like.dtype != np.dtype("B"): | |||
if array_like.dtype.itemsize != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests in test_nvcomp.py
were failing without this change.
I'd like to get us to a point where we don't care as much about the details of the buffer passed in here. This is an OK start I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly does this check for? It's not clear to me why any numpy array that can be viewed as bytes wouldn't be allowed in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and same for the dimensionality check, since any N-dimensional numpy array can be viewed as a 1D array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not really sure...
I agree that the actual data we store internally here needs to be a byte dtype. Just doing cp.asarray(input).view("b")
seems pretty reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not even convinced that we need Buffer
/ NDBuffer
, when Buffer
is just a special case of NDBuffer
where there's 1 dimension and the data type is bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could even express this formally by:
- making
NDBuffer
generic with two type parameters (number of dimensions and dtype) - having APIs that insist on consuming a
Buffer
instead insist on consumingNDBuffer[Literal[1], np.uint8]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(super out of scope for this PR ofc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why any numpy array that can be viewed as bytes wouldn't be allowed in here
I think this is mainly because NDBuffer
objects don't need to be contiguous, but Buffer
objects must be contiguous in memory which might be important when we send those out to codecs that expect a contiguous memory slice.
But I agree that we can probably merge those two and make Buffer
a specialization of NDBuffer
.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2863 +/- ##
==========================================
- Coverage 60.68% 60.68% -0.01%
==========================================
Files 78 79 +1
Lines 9356 9424 +68
==========================================
+ Hits 5678 5719 +41
- Misses 3678 3705 +27
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @akshaysubr, I think this is in a good state.
I'll open up a follow up issue to discuss how to handle devices and streams. I'll leave this open for a few days if anyone is interested in taking a look, and we can hopefully merge it next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the really nice work here! I'm afraid I'm going to mark this as request changes, because we still need to work out what our policy on including new codecs is - should they be included in numcodecs
, or should they be included directly in zarr-python
. There's a few reason we need to work this out before adding new codecs:
- For developers, make it clear where to contribute new codecs (I know someone interested in implementing
blosc2
at the moment for example) - For users, we need a clear story/documentation about where codecs live. If the answer is across both numcodecs and zarr-python, then we need clear cross linking documentation to make sure this is clear.
I've opened an issue at #3272 to discuss this; once that's resolved we can either continue here or move over to numcodecs.
I left some comments inline - might be worth waiting for the above to be resolved before looking at them?
Finally, this should also get some documentation in the user guide about how to swap between using the CPU and GPU when both are an option.
thanks again for the work - it's super appreciated, and I'm sorry that merging is going to be a bit delayed until we work out what our long term plan with implementing new codecs is.
""" | ||
return (ZstdCodec(),) | ||
return (cast(BytesBytesCodec, get_codec_class("zstd")()),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the extra cast
needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_codec_class
returns type[Codec]
but this function specifically returns a tuple[BytesBytesCodec]
.
Akshay responded on this point in February(!) @dstansby could you at least respond to that directly: that this is a wrapper around a codec engine, in exactly the same way that https://github.com/zarr-developers/zarr-python/blob/abbdbf2be70a24e7d662b5ed449c68f6718977f9/src/zarr/codecs/zstd.py is?
That's already included in |
where is the source code for the nvcomp zstd implementation, and the python bindings? |
I don't believe that the source code for those are published publicly. |
That's potentially quite problematic. We recently had problems relating to skew across zstd implementations. If we cannot inspect the source code for this codec, and we cannot submit patches, then I would definitely not be interested in experiencing bugs from it. |
Is that you speaking as a user of zarr or a maintainer? From the user perspective, this will only be used if you activate it. From the maintainer's perspective, I'd hope that through documentation and clear error messages we can pinpoint the issue for users. |
Sorry, I missed that. And sorry for the slow reply - I am not paid to work on zarr-python, and review stuff on a best efforts basis. re. where stuff lives, I did not realise that we had codec classes implemented directly in
Given that codecs classes are in
Am I right in thinking that GPU arrays and now decompression are a global option, not a per-array or a per-operation configuration? If so it would be good to clarify in gpu.rst that a) this is only a global option b) if it can be changed during a python session, and c) how decompression of GPU buffers is handled for codecs without a GPU implementation (I presume it just falls back to the CPU impelementation) re. @d-v-b's point about about skew across zstd implementations, I think this is a very important point. I'm personally on the fence about depending on a closed source library, but as a minimum I think there should be tests that compare the compression/decompression with the (open source I presume?) CPU implementation and the new GPU implementation, and make sure the results are the same. |
Both? Zarr users and maintainers routinely expose bugs in libraries we depend on, and we routinely forward issues to those libraries, and occasionally contribute patches. If users report a bug or problem in nvcomp, what do we do? Suppose you or akshay leave nvidia, should I email support@nvidia.com when we have problems? |
Likewise :/
I'm not sure... In I think that https://zarr.readthedocs.io/en/stable/user-guide/config.html is the relevant documentation here:
This isn't adding any new concepts: just a new implementation that uses the existing configuration system (which I think is fair to summarize as a global managed through our configuration system).
Sure... It's the same as any other configuration option. We don't explicitly show using any option in a context manager, but we do link to https://github.com/pytroll/donfig, which documents that behavior. I can duplicate that in our documentation if you want.
Yeah, it's following the documented behavior: the codec class associated with a given codec ID is used. We can repeat that in the GPU documentation and cross-reference the configuration docs. So from a generic codec side of things, it'll just use whatever codec implementation is associated with that codec name. From a GPU-specific side of things, we'll want to document which codecs are currently implemented (I imagine the remaining codecs will be a relatively straightforward refactor based on top of what this branch implements).
An issue filed at https://github.com/NVIDIA/CUDALibrarySamples would be best (I personally would ask the reporter to submit that rather than filing on their behalf, but you're probably more patient than me 😄).
I'll add those to this PR. |
I believe that all the comments have been addressed now but let me know if I missed anything. I'm unsure why codecov is reporting that the coverage dropped here. Locally I see 100% coverage for |
I confess that I'm worried about retracing the steps that led to a large number of poorly-maintained storage backends in zarr-python 2. I'm not saying that this functionality will be poorly maintained, but it is a risk, and I think we need to make a decision as a project about how we weigh that risk. For this particular feature, what would it look like if we spun it off into a separate repo under the zarr-developers org? |
Let's see: For users, I think it could in principle be mostly seamless. We have entry points for loading a codec class without requiring users to import it. If this were in a separate package, I'd push for the We'd need some system to ensure that For maintainers, it'd be additional overhead from a separate project (CI, packaging, docs), which is minor but non-negligible. My main question: is zarr-python really willing to commit to this as a stable interface, to the degree that third parties can build on it without worrying about things breaking? I worry that we don't know enough about the codec interface and its interaction with buffers yet to really commit to that (e.g. all your suggestions about buffers earlier in the thread). |
This PR adds a Zstd codec that runs on the GPU using the nvCOMP 4.2 python APIs.
TODO:
docs/user-guide/*.rst
changes/