-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: AXV512 SIMD optimizations for float power fast paths #28248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: AXV512 SIMD optimizations for float power fast paths #28248
Conversation
Update: made a few "simplifying" changes that removed some functionality to fix some bugs. Will try to add it back again more carefully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function already has a #ifdef AVX512 block. Can we simplify this patch by moving the SIMD code to that portion? I suggest we add the special case to simd_pow_f32/f64
function defined above.
@r-devulap Sure! Maybe we should create a new helper function though, because putting all the zero-stride logic into |
I've pushed a simplification using a helper function. Hope that suffices! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current patch has a bug. Please do add a test to test the new code paths. Looks like we do not have coverage.
const npy_intp ssrc2 = steps[1] / sizeof(@type@); | ||
const npy_intp sdst = steps[2] / sizeof(@type@); | ||
|
||
if (stride_zero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if condition here isn't ideal and slows down the function which doesn't hit the special paths. After you fix the loop, could you provide benchmark numbers on AVX-512?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will share. I don't have AXV-512 on my laptop but can test using Intel's simulator (that NumPy tests also use) if that would work for performance too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is only useful for testing and not benchmarking. Benchmarking numbers wont be reliable using an emulator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, do you have any recommendations for how we can proceed in that case, without supported hardware for benchmarking? Sorry about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I benchmarked this on a TGL:
| Change | Before [79f44308] <main> | After [9a0fa57f] <pow_fast_paths> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|-------------------------------------|---------|---------------------------------------------------------------------|
| + | 234±2μs | 466±10μs | 1.99 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>) |
| - | 1.89±0.02ms | 1.76±0.01ms | 0.94 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
| - | 835±20μs | 745±9μs | 0.89 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>) |
| - | 2.30±0.2ms | 1.77±0.01ms | 0.77 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| - | 2.18±0ms | 1.10±0ms | 0.5 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>) |
| - | 628±1μs | 249±2μs | 0.4 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>) |
| - | 1.10±0.01ms | 286±1μs | 0.26 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>) |
The 2x regression on arrr ** 2
is a bit odd. Any idea why that can happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I suspect it's related to the fact that 2
is an integer scalar and it is probably taking the fast_scalar_power
function rather than promoting the types and hitting the UFunc at all. fast_scalar_power
calls np.square
. If square
is not optimized for SIMD, or something else in the implementation diverges, that might cause it to be slower than the power function which was already baseline SIMD-optimized.
PS. Probably not exactly right sorry; it should also have hit square
in main
! Perhaps some bug around fast_scalar_power
is causing the operation to run twice, like a failing equality or error check...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah its bit odd. I tracked down the function call stack for both main and your branch and both of them trigger FLOAT_square_SSE41
which has nothing to do with your patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I confirmed that using GDB as well, so there's isn't the bug I suspected. The calls should be identical...
Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for doing the review in pieces. But I have one main concern to address which was also a problem in #26055. From that PR (see #26055 (comment)):
The issue is that np.power(-float('inf'), .5) and np.sqrt(-float('inf')) have different values.
I also wonder if this was discussed in #27901, because the existing scalar path added in this PR also has this problem. @seberg any thoughts?
const npy_intp ssrc2 = steps[1] / sizeof(@type@); | ||
const npy_intp sdst = steps[2] / sizeof(@type@); | ||
|
||
if (stride_zero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I benchmarked this on a TGL:
| Change | Before [79f44308] <main> | After [9a0fa57f] <pow_fast_paths> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|-------------------------------------|---------|---------------------------------------------------------------------|
| + | 234±2μs | 466±10μs | 1.99 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>) |
| - | 1.89±0.02ms | 1.76±0.01ms | 0.94 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
| - | 835±20μs | 745±9μs | 0.89 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>) |
| - | 2.30±0.2ms | 1.77±0.01ms | 0.77 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| - | 2.18±0ms | 1.10±0ms | 0.5 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>) |
| - | 628±1μs | 249±2μs | 0.4 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>) |
| - | 1.10±0.01ms | 286±1μs | 0.26 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>) |
The 2x regression on arrr ** 2
is a bit odd. Any idea why that can happen?
Thanks and no worries! It was indeed discussed in that PR--we discussed the possibility to dispatch If that works we could always optimize That seems to be out of scope for this patch but we probably cannot merge this with such a large regression... |
numpy/_core/tests/test_umath.py
Outdated
def test_large_fast_power(self): | ||
# gh-28248 | ||
for dt in [np.float32, np.float64]: | ||
a = np.random.uniform(1, 4, 1000).astype(dt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to pin the random seed/Generator with the usual i.e., default_rng()
approach here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done, thanks!
Could you fix the CI failure on arm? Updating ULP error to 2 should fix it (for float32, we are usually tolerant up to 3ULP). |
Yes, done. |
I think I can do this soon--is it best if I do it in this PR? Thank you! |
Closing this for now—happy to reopen if there's renewed interest! |
Adds SIMD optimizations to float power fast paths (
np.power
) under AXV512.This is shared somewhat as a start, since I wondered if this optimization is too granular and complicates things; though I think these are common enough to be valuable. It can be cleaned up further once we have optimizations for
np.reciprocal
,np.square
, etc.Thank you for reviewing!