-
-
Notifications
You must be signed in to change notification settings - Fork 350
Add async oindex and vindex methods to AsyncArray #3083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3083 +/- ##
==========================================
+ Coverage 60.68% 60.73% +0.04%
==========================================
Files 78 78
Lines 9356 9407 +51
==========================================
+ Hits 5678 5713 +35
- Misses 3678 3694 +16
🚀 New features to boost your workflow:
|
@dcherian suggested making the sync oindex and vindex getitem methods call the new async versions. EDIT: I think this is already the case? |
@property | ||
def oindex(self) -> AsyncOIndex[T_ArrayMetadata]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose this API to try to follow this pattern:
Array.__getitem__
(exists)Array.oindex.__getitem__
(exists)Array.vindex.__getitem__
(exists)AsyncArray.getitem
(exists)AsyncArray.oindex.getitem
(new)AsyncArray.vindex.getitem
(new)
because python doesn't let you make an async version of the __getitem__
magic method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the sync versions to use sync(self._async_array.oindex.getitem)
instead of going to _get_selection
directly? That will get us some test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds very sensible, but when I try to refactor to do that I find that it causes a problem where sync()
gets called from within an async coroutine. I wouldn't have thought this should happen, but here's claude's explanation of why it does:
> Explain how exactly sync is being called from within an async coroutine
⏺ Looking at the error and the code, here's exactly how sync() is being called from within an async coroutine:
The Call Chain
1. Test calls zarr array indexing: za[zix] in test_indexing_with_zarr_array
2. Array.getitem (src/zarr/core/array.py:2566):
return self.vindex[cast("CoordinateSelection | MaskSelection", selection)]
3. VIndex.getitem (src/zarr/core/indexing.py:1248):
return sync(self.array._async_array.vindex.getitem(selection))
4. sync() tries to run the coroutine but detects it's already in a running loop
The Problem
The issue is in the __array__ method call chain:
1. VIndex.__getitem__ calls sync() with self.array._async_array.vindex.getitem(selection)
2. This eventually leads to AsyncArray._get_selection() being called
3. But somewhere in the process, Array.__array__() gets called (src/zarr/core/array.py:2413)
4. __array__() calls self[...] which goes back to Array.__getitem__
5. This creates a nested call where sync() is called while already inside an async context
The Root Cause
The function _zarr_array_to_int_or_bool_array() at line 85 calls np.asarray(arr), which triggers the __array__ protocol on the zarr array. This causes:
def _zarr_array_to_int_or_bool_array(arr: Array) -> npt.NDArray[np.intp] | npt.NDArray[np.bool_]:
if arr.dtype.kind in ("i", "b"):
return np.asarray(arr) # <-- This calls arr.__array__()
When np.asarray() is called on a zarr Array, it calls Array.__array__(), which calls self[...], which eventually calls sync() again - but we're already
inside a sync() call from the VIndex, creating the nested async context error.
The original code before the changes avoided this by handling the zarr array conversion within the sync methods directly, rather than delegating to async
methods that would create this nested sync situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for indexing with a Zarr array, we should convert to numpy array before the sync call
Shall we also have the sync array getitme methods use these async methods? zarr-python/src/zarr/core/array.py Lines 2785 to 2792 in 6fa9f37
|
Yea I would like to, but don't fully understand how to get that to work. So I thought I could leave that for a follow-up. |
OK but presumably that errors means async indexing with Zarr arrays also doesn't work (https://github.com/zarr-developers/zarr-python/pull/3083/files#r2231114456). Can you open an issue to track please? |
I didn't even know it was possible to index a zarr array with another zarr array!
Actually I just added a test that seems to show that indexing a zarr array with a (sync) zarr array does work. I also tried indexing a zarr array with the # try indexing with async zarr array
> result = await async_zarr.oindex.getitem(z2._async_array)
tests/test_indexing.py:2061:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/zarr/core/indexing.py:974: in getitem
return await self.array.get_orthogonal_selection(
src/zarr/core/array.py:1440: in get_orthogonal_selection
indexer = OrthogonalIndexer(selection, self.shape, self.metadata.chunk_grid)
src/zarr/core/indexing.py:878: in __init__
dim_indexer = BoolArrayDimIndexer(dim_sel, dim_len, dim_chunk_len)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <[AttributeError("'BoolArrayDimIndexer' object has no attribute 'dim_sel'") raised in repr()] BoolArrayDimIndexer object at 0x107e8e990>
dim_sel = <AsyncArray memory://4427184448/z2 shape=(2,) dtype=bool>, dim_len = 2, dim_chunk_len = 1
def __init__(self, dim_sel: npt.NDArray[np.bool_], dim_len: int, dim_chunk_len: int) -> None:
# check number of dimensions
if not is_bool_array(dim_sel, 1):
raise IndexError("Boolean arrays in an orthogonal selection must be 1-dimensional only")
# check shape
if dim_sel.shape[0] != dim_len:
raise IndexError(
f"Boolean array has the wrong length for dimension; expected {dim_len}, got {dim_sel.shape[0]}"
)
# precompute number of selected items for each chunk
nchunks = ceildiv(dim_len, dim_chunk_len)
chunk_nitems = np.zeros(nchunks, dtype="i8")
for dim_chunk_ix in range(nchunks):
dim_offset = dim_chunk_ix * dim_chunk_len
chunk_nitems[dim_chunk_ix] = np.count_nonzero(
> dim_sel[dim_offset : dim_offset + dim_chunk_len]
)
E TypeError: 'AsyncArray' object is not subscriptable
src/zarr/core/indexing.py:613: TypeError If that is supposed to work I can raise an issue for it, but it doesn't seem to be the same |
The Claude diagnosis points to a sync |
Not sure what to do about codecov, except add more tests |
Thanks for the extra tests! |
…3311) Co-authored-by: Tom Nicholas <tom@earthmover.io>
Array
has.oindex
and.vindex
methods, butAsyncArray
has no equivalent. This PR adds them. It only adds the get methods, not the set methods, which I thought could be deferred to a follow-up PR.I want it for pydata/xarray#10327 (comment)
TODO:
docs/user-guide/*.rst
changes/