-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
MAINT: simplify power fast path logic #27901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: simplify power fast path logic #27901
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for the start! I was worried that int_arr**-1
may have changed, but it always error'd in both versions :).
So, this would certainly fix the issue and is a very nice amount of code deletions.
I think the main thing that we need to do is ensure that there is no (big) speed regression with any of the cases that are currently fast-pathed.
(That is all the cases for floats and ** 2
for integers, I think.)
We may already have benchmarks for these (not sure), if that is so, then just running them and showing the result would be sufficient or start with trying manually and then we see if we should add benchmarks.
numpy/_core/src/umath/loops.c.src
Outdated
} | ||
if (in1 == 1) { | ||
else if (in1 == 1) { | ||
*((@type@ *)op1) = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by the new (and existing) fast paths here. The first branch is the scalar case and it would seem to me that any fast path is even more relevant there?
Should the fast path just be copied? Maybe there should be a second helper with the body so that we can have a if (stride[1] == 0) {call helper()} else {call_helper()}
to nudge the compiler to optimize for 0 stride (and assume it'll lift the checks out of the loop then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the if (steps[1]==0) {
path (sorry, I wrote stride
above, the ufunc code here calls it "steps") is the path we need to worry about being fast.
I.e. if the second operand is a scalar, then we will always take that part (I am not 100% sure about scalar**scalar
.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I stuck to the original fast path logic but I agree the scalar case is more important for efficiency. I've created a helper function to remove repetition.
Thanks for the review :) We do seem to have benchmarks ( This is the output without the changes in this PR:
I suppose |
Can we (I suppose we can) trust that the compiler is smart enough to lift that The main difference should be that for Not sure we should be concerned about it, but if anyone is, an exact fast path for common Python integers only to call EDIT: Of course |
Yes, a surface-level check in |
Both float fast paths and the array power fast paths are complete! It seems a specific test under |
The regression test seems to concern norms on object arrays. I have updated the test to consider a larger amount of possible exceptions. Perhaps if someone was using Aside from this, my PR should be ready! I will print the complete benchmarks soon. |
My bad, just fixed an erroneously declared extra variable in the loops. |
Wrote an enhancing check to improve speed when the exponent is a scalar. Here are the benchmarks with changes:
And without:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you don't mind if there may be a bit more iteration. You were so active that I thought I should put in some feedback.
Overall, this looks good. Need to rethink the new fast-paths once and would like to double check on what changed exactly in those tests.
numpy/_core/src/multiarray/number.c
Outdated
{ | ||
if (!PyArray_Check(o1)) { | ||
return -1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems removing it causes segmentation faults in the tests. Perhaps the compiler uses it to optimize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i was going to delete it and forgot. If (and only if) the later exact checks match then the PyArray_ISOBJECT
can be done safely (although some code paths used to call this with invalid inputs, I don't think they do anymore).
So you can probably remove it, but it would require re-organizing the PyArray_ISOBJECT
check to the end.
(If it still segfaults, some scalar path calls in here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so that it would be memory-safe in case of the nonzero returns! Thanks--this is done.
numpy/_core/src/multiarray/number.c
Outdated
PyArrayObject *a1 = (PyArrayObject *)o1; | ||
if (PyArray_ISOBJECT(a1)) { | ||
return -1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Yes, object is special and we cannot do this (unless we changed the object ufuncs, but not sure that would be right for **0.5
at least.
One nitpick, though: Please don't use -1
as "not taken". -1
typically means error occurred, so I think it is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Fixed this.
assert_raises(ValueError, linalg.norm, testvector, ord=-2) | ||
for ord in ['fro', 'nuc', np.inf, -np.inf, 0, -1, -2]: | ||
pytest.raises( | ||
(ValueError, ZeroDivisionError, RuntimeWarning), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe say an example of the core operation that changed (i.e. what kind of array to the power of what kind of scalar this ends up at)?
I would generally not worry about the type of error. But the RuntimeWarning
is only an error in CI, maybe this gives a warning and then later raises or maybe not.
I agree those changes are probably fine, but I would feel better seeing an example (I can dig that up myself, but maybe you got it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the norm
function does some casts on the object array:
if not issubclass(x.dtype.type, (inexact, object_)):
x = x.astype(float)
As far as I can tell, the regression test checks if this cast works. Probably why there are cases for the arrays being equal and the dtype being float64
.
With the fast paths, we call reciprocal
directly, which raises a ZeroDivisionError
or RuntimeWarning
(depending on deployment) instead of a ValueError
.
I can provide a concrete traceback soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide that traceback/details (I assume it's quick for you)? The RuntimeWarning
case is the interesting one. If we ignore the RuntimeWarning
, will we hit a ValueError
later?
If we don't hit any error in that path, then we need to think about it more carefully.
(And also adjust the test, if just to add with np.errstate(something="raise"):
).
else if (in2 == 0.5) { | ||
BINARY_LOOP_SLIDING { | ||
const @type@ in1 = *(@type@ *)ip1; | ||
*(@type@ *)op1 = sqrt(in1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need to use some type specific sqrt
function. More importantly, though, we won't get all the optimized versions here.
You should probably call @TYPE@_sqrt
(which includes the loop), but I think that might not dispatch right. @seiko2plus can probably say instantly what to put here to call sqrt
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I'll try to get @TYPE@_sqrt
to dispatch right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a C type specific sqrt
function is assigned on line 229:
* #sqrt = sqrt, sqrtf#
I suppose it is still better for consistency to use the UFunc implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On looking it up it seems @TYPE@_sqrt
is indeed difficult to dispatch right without allocating a new array, so we would indeed need another call.
Given that type-specific functions are defined though, I'm wondering if just pushing all of these conditions into the BINARY_LOOP_SLIDING
loop and seeing if the compiler optimizes might be worth the tradeoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just used @sqrt@
and restructured the loop to include the conditions. It doesn't seem to have hurt performance and looks cleaner.
} | ||
return; | ||
} | ||
else if (in2 == 2.0) { | ||
BINARY_LOOP_SLIDING { | ||
const @type@ in1 = *(@type@ *)ip1; | ||
*(@type@ *)op1 = in1 * in1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, here we also should maybe just call the normal multiply loop (with duplicated op1
pointer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do with the SQRT!
numpy/_core/src/umath/loops.c.src
Outdated
BINARY_LOOP_SLIDING { | ||
@type@ in1 = *(@type@ *)ip1; | ||
if (_@TYPE@_power_fast_path_helper(in1, in2, (@type@ *)op1) != 0) { | ||
*((@type@ *) op1) = _@TYPE@_squared_exponentiation_helper(in1, in2start, first_bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, this can never be taken right if we have the outer if? Maybe put an assert(0) /* unreachable */
to clarify or so?
(Maybe it was nicer without the outer check and the compiler will do it anyway, but this is good).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think it can, because of the in1 != 0
check. The memory overlap symmetry tests fail when I do not check if in1 == 0
to do the fast paths. It seems zeros play a special role in memory overlap? (We would not need the outer check for efficiency if not for this case.)
Edit: to clarify, the outer check is for efficiency, as the compiler cannot be relied upon to simplify if we have the in2 != 0
check the way it is (though perhaps there is a way to write things to improve a lot of this -- investigating that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so it seems some of the boolean logic was the issue for the symmetry checks. Unfortunately it doesn't seem like the compiler optimizes either way, inside the loop. So I created a boolean variable to keep track of whether any fastop check has failed. It looks a bit confusing, though it improved the benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: seems like the compiler can optimize if the check boolean is declared outside of the loop, and given the boolean logic in the helper is fixed, we have a better looking solution now.
numpy/_core/src/multiarray/number.c
Outdated
if (PyLong_CheckExact(o2)) { | ||
long exp = PyLong_AsLong(o2); | ||
if (error_converting(exp)) { | ||
PyErr_Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine and happy to keep it as is. In general blanket PyErr_Clear()
has a slight "code smell" (we still have plenty of them in NumPy, though).
This is possible to remove with PyLong_AsLongAndOverflow
because if an error occurs with that, I think we can assume it is a critical error (KeyboardInterrupt
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Made this change.
No, don't mind at all! I appreciate the feedback. I am rethinking some of the logic I built to pass the symmetry checks. Perhaps it will help clean up the code further. |
Made several changes, including adding the reciprocal fast path to floats though it probably needs to be optimized. Thanks for reviewing. |
After running the benchmarks, it seems that the performance decreases could indeed be flukes, and on the whole, checking for Python scalars might have actually reduced performance by a slight amount. I removed them to decrease the amount of code. If we would still like to have them, happy to add back. I couldn't find an efficient way to dispatch Here are the compared benchmarks (run with
This should be ready for review! Thank you. Edit: there seems to be a test fail with an absolute difference ~2e-7 in one specific environment. Seems to do with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I am happy to not have any fast paths. The only worry I have are the fast paths for float arrays getting slower.
Can you check (if just with %timeit
the speed of large_float_arr**0.5
and large_float_arr**2
?
I suspect we don't have any benchmarks for the operators here (you could add them of course).
@seiko2plus if you could give a hand and see if we can just forward the call to the actual SIMD implementations easily that would be nice.
(It's a bit annoying, because we have to build new strides/pointers though...)
So, while I would love to just get rid of the fast-paths. I do suspect we need them at least for floats to avoid a regression.
(The ufuncs can do it now in theory, but I think it is too much hassle to do in this PR unless we can do it hassle free enough from inside the inner-loop.)
assert_raises(ValueError, linalg.norm, testvector, ord=-2) | ||
for ord in ['fro', 'nuc', np.inf, -np.inf, 0, -1, -2]: | ||
pytest.raises( | ||
(ValueError, ZeroDivisionError, RuntimeWarning), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide that traceback/details (I assume it's quick for you)? The RuntimeWarning
case is the interesting one. If we ignore the RuntimeWarning
, will we hit a ValueError
later?
If we don't hit any error in that path, then we need to think about it more carefully.
(And also adjust the test, if just to add with np.errstate(something="raise"):
).
Thanks! That makes sense. Added the benchmarks. Yes, float operators did get slower:
I've added the fast paths back:
The
If that looks fine, I could look into adjusting the test for sure, to ensure the warning does trigger one of the errors! |
What happens in the polynomials in the end is fine, maybe. But the underlying issue is not fine. If you look into it e.g. with pdb
but that piece of code must raise:
Not sure why, did you add a fast path for reciprocals which are not defined? |
Yes, you're right, sorry for the oversight. It seems to be at the scalar level, where it seems I did not disallow any fast paths for integers. I'll fix that. |
I have restricted the scalar fast paths to float and complex arrays. The regression test is now reverted to its original state. Doing more testing to ensure everything works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaanasArora, I think we can basically put this in, and although I would like @seiko2plus (or @r-devulap) to have a look at the power changes because I suspect we can do better, I'll just put it in soon anyway.
The change might even make things worse, but only on avx512-skx
, and otherwise the fast-path seems fine to me (whether it changes the results by a tiny bit or not).
OK, let's give this a shot, thanks. Still might be nice to think about how to nicely fall back to the specialized function for the fast-paths. |
Thank you! Happy to continue helping with this! |
This is an initial draft to resolve #27082.
I have removed the fast paths from array_power and am planning to implement them in the individual UFunc templates which seem to be few. This prevents the need for scalar extraction and reduces divergence from the UFunc. Currently, I have only implemented for integer loops, but would appreciate feedback before I proceed.
I hope I understood the issue correctly. Thank you!