-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
ENH: stats.pearsonr: add array API support #20284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[skip actions] [skip cirrus]
This sounds right to me. As long as we avoid converting back-and-forth multiple times, one pair of conversions is necessary (unless someone decides to write the special functions in pure Python 😅 (or the special extension gets into the standard...) and the distn. infra gets support). |
Yes, in the near term, the new distribution infrastructure will be able to evaluate the special functions in an array API compatible way, and I will give the resampling methods array API support soon, too. So this would just be temporary, probably for one release only, if that. Further out, yeah, the special function array API extension (data-apis/array-api#725) would speed things up considerably. (Oops looks like you mentioned it in an update, but maybe good to have the link here for others.) So in the meantime, is there a canonical way to do the conversion? |
Just use |
c5d60dd
to
675dde6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking pretty good!
02242a4
to
6ddde92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to add the CI check in this PR?
res_ci = res.confidence_interval() | ||
ref_ci = ref.confidence_interval() | ||
xp_assert_close(res_ci.low, xp.asarray(ref_ci.low)) | ||
xp_assert_close(res_ci.high, xp.asarray(ref_ci.high)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use input validation tests. Maybe should also check edge cases like length-2 input, constant input, etc... with array API but I imagine that someday we'll just want to have the option of running all tests with non-numpy arrays, no? I don't think we should duplicate existing tests for array API, right? Should I convert all the tests now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, better to convert existing tests where possible, perhaps splitting into 2 and running one with np_only
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
scipy/scipy/fft/tests/test_basic.py
Lines 332 to 368 in 235602c
@skip_if_array_api(np_only=True) | |
@pytest.mark.parametrize("dtype", [np.float16, np.longdouble]) | |
def test_dtypes_nonstandard(self, dtype): | |
x = random(30).astype(dtype) | |
out_dtypes = {np.float16: np.complex64, np.longdouble: np.clongdouble} | |
x_complex = x.astype(out_dtypes[dtype]) | |
res_fft = fft.ifft(fft.fft(x)) | |
res_rfft = fft.irfft(fft.rfft(x)) | |
res_hfft = fft.hfft(fft.ihfft(x), x.shape[0]) | |
# Check both numerical results and exact dtype matches | |
assert_array_almost_equal(res_fft, x_complex) | |
assert_array_almost_equal(res_rfft, x) | |
assert_array_almost_equal(res_hfft, x) | |
assert res_fft.dtype == x_complex.dtype | |
assert res_rfft.dtype == np.result_type(np.float32, x.dtype) | |
assert res_hfft.dtype == np.result_type(np.float32, x.dtype) | |
@pytest.mark.parametrize("dtype", ["float32", "float64"]) | |
def test_dtypes_real(self, dtype, xp): | |
x = xp.asarray(random(30), dtype=getattr(xp, dtype)) | |
res_rfft = fft.irfft(fft.rfft(x)) | |
res_hfft = fft.hfft(fft.ihfft(x), x.shape[0]) | |
# Check both numerical results and exact dtype matches | |
rtol = {"float32": 1.2e-4, "float64": 1e-8}[dtype] | |
xp_assert_close(res_rfft, x, rtol=rtol, atol=0) | |
xp_assert_close(res_hfft, x, rtol=rtol, atol=0) | |
@pytest.mark.parametrize("dtype", ["complex64", "complex128"]) | |
def test_dtypes_complex(self, dtype, xp): | |
x = xp.asarray(random(30), dtype=getattr(xp, dtype)) | |
res_fft = fft.ifft(fft.fft(x)) | |
# Check both numerical results and exact dtype matches | |
rtol = {"complex64": 1.2e-4, "complex128": 1e-8}[dtype] | |
xp_assert_close(res_fft, x, rtol=rtol, atol=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I convert all the tests now?
If you want - it should be done eventually at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this separate test after converting the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that. I'll think about it some more before my next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For old functions like this, without really thinking through the existing test suite from the bottom up, it's hard to be sure that the tests are sufficient. So I think it's valuable to have a property-based test like this. To be more thorough, I might try using hypothesis
.
Oops, missed that.
at the end of |
Yeah, looks correct 👍 |
I had to create my own Regarding if res.ndim == 0:
return res.item
else:
return res Using
|
@rgommers has this point about returning scalars come up previously? |
I don't think so. This pattern is pretty specific to SciPy.
Agreed, that is better.
That looks okay to me; I don't think it'll get clearer/shorter than that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't finished yet but the array API changes look good to me (I do need to check CUDA though). If a stats maintainer could give a 👍 once it is ready that would be good.
@asmeurer out of interest, do you have a rough ETA for 2023.12 support in array-api-strict?
scipy/stats/_stats_py.py
Outdated
|
||
def _move_axis_to_end(x, source, xp): | ||
axes = list(range(x.ndim)) | ||
temp = axes.pop(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp = axes.pop(source) | |
temp = axes.pop(source) |
Actually I do think I'd like to consider this done, unless there was something besides the array API test? After I do a few of these, I'll come back and replace the separate array API test with a more thorough test with |
@tupui Would you be interested in checking that I didn't change anything from a stats perspective? |
I'm having lots of problems with trying to get CUDA working again after changing some drivers while working on JAX. I'll try to get it resolved but someone else may have to check GPU. |
Okay, I got PyTorch CUDA working, quite a few failures, see below @mdhaber. CuPy will have to wait (cupy/cupy#8260 ...) but that's alright. traceback
|
Didn't we expect this given use of |
Yep, I think it just means that we need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an env with CuPy, python dev.py test -t scipy.stats.tests.test_stats -b all
passes for me. LGTM once there is approval from the stats side!
I've not been following these closely, but have we changed our policy to allow array API support to be added partially to modules now? |
Partial support is currently released in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stats part is fine to me. And Lucas had a look at the Array API side so good to go. We could have moved the utils to the global utils that we have for Array API, though I don't think we will really "forget" so ok like this.
Upstream issue: data-apis/array-api-strict#25 |
Reference issue
gh-20137
Closes gh-20324
What does this implement/fix?
This explores the addition of Array API support to
scipy.stats.pearsonr
. Only the last commit is relevant; the others are from gh-20137.Additional information
Need to resolve merge conflicts and skip a test on 32-bit, but otherwise, I think this is ready to go.
Old news:
Most of this will be pretty straightforward. There are some little things I'll want to address later (e.g. previously,
pearsonr
converted inputs to be at leastfloat64
, but I imagine we'd want to respect dtype with array API), but there is one big question for now:Calculation of the p-value currently relies on the incomplete beta function, which is not among the special functions for which we have experimental array API support (gh-19023). Even if it were, calculation of the p-value currently relies on the distribution infrastructure. In any case, calculation of the statistic with an alternative array backend can easily be done now, but calculation of the p-value with an alternative array backend will take more time.
For vectorized calculations, I think there is still value in calculating the statistic with the alternative array backend, then converting the statistic (which has been reduced by at least one dimension) to a NumPy array for calculation of the p-value. Are there objections to this?
I know there have been objections to converting non-NumPy arrays to NumPy arrays for running compiled code, but I think this is a little different, since the statistic might not even be an array after the reducing operation, and if it is, it's smaller than the original array.