ENH: AXV512 SIMD optimizations for float power fast paths #28248

MaanasArora · 2025-01-29T14:02:45Z

Adds SIMD optimizations to float power fast paths (np.power) under AXV512.

This is shared somewhat as a start, since I wondered if this optimization is too granular and complicates things; though I think these are common enough to be valuable. It can be cleaned up further once we have optimizations for np.reciprocal,np.square, etc.

Thank you for reviewing!

MaanasArora · 2025-01-29T16:09:37Z

Update: made a few "simplifying" changes that removed some functionality to fix some bugs. Will try to add it back again more carefully.

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src

r-devulap

The function already has a #ifdef AVX512 block. Can we simplify this patch by moving the SIMD code to that portion? I suggest we add the special case to simd_pow_f32/f64 function defined above.

MaanasArora · 2025-02-01T11:52:50Z

@r-devulap Sure! Maybe we should create a new helper function though, because putting all the zero-stride logic into simd_pow_f32/f64 will need us to pass some information about the stride being zero, in order to make use of the optimizations? We should probably not break the argument structure for an optimization case?

MaanasArora · 2025-02-01T18:25:38Z

I've pushed a simplification using a helper function. Hope that suffices!

r-devulap

The current patch has a bug. Please do add a test to test the new code paths. Looks like we do not have coverage.

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src

r-devulap · 2025-02-05T22:59:40Z

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src

+        const npy_intp ssrc2 = steps[1] / sizeof(@type@);
+        const npy_intp sdst  = steps[2] / sizeof(@type@);
+
+        if (stride_zero) {            


The if condition here isn't ideal and slows down the function which doesn't hit the special paths. After you fix the loop, could you provide benchmark numbers on AVX-512?

Sure, will share. I don't have AXV-512 on my laptop but can test using Intel's simulator (that NumPy tests also use) if that would work for performance too.

That is only useful for testing and not benchmarking. Benchmarking numbers wont be reliable using an emulator.

Right, do you have any recommendations for how we can proceed in that case, without supported hardware for benchmarking? Sorry about this.

No worries. I benchmarked this on a TGL:

| Change | Before [79f44308] <main> | After [9a0fa57f] <pow_fast_paths> | Ratio | Benchmark (Parameter) | |----------|----------------------------|-------------------------------------|---------|---------------------------------------------------------------------| | + | 234±2μs | 466±10μs | 1.99 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>) | | - | 1.89±0.02ms | 1.76±0.01ms | 0.94 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) | | - | 835±20μs | 745±9μs | 0.89 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>) | | - | 2.30±0.2ms | 1.77±0.01ms | 0.77 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) | | - | 2.18±0ms | 1.10±0ms | 0.5 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>) | | - | 628±1μs | 249±2μs | 0.4 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>) | | - | 1.10±0.01ms | 286±1μs | 0.26 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>) |

The 2x regression on arrr ** 2 is a bit odd. Any idea why that can happen?

Thanks! I suspect it's related to the fact that 2 is an integer scalar and it is probably taking the fast_scalar_power function rather than promoting the types and hitting the UFunc at all. fast_scalar_power calls np.square. If square is not optimized for SIMD, or something else in the implementation diverges, that might cause it to be slower than the power function which was already baseline SIMD-optimized.

PS. Probably not exactly right sorry; it should also have hit square in main! Perhaps some bug around fast_scalar_power is causing the operation to run twice, like a failing equality or error check...

yeah its bit odd. I tracked down the function call stack for both main and your branch and both of them trigger FLOAT_square_SSE41 which has nothing to do with your patch.

Yes, I confirmed that using GDB as well, so there's isn't the bug I suspected. The calls should be identical...

numpy/_core/tests/test_umath.py

Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>

r-devulap

Apologies for doing the review in pieces. But I have one main concern to address which was also a problem in #26055. From that PR (see #26055 (comment)):

The issue is that np.power(-float('inf'), .5) and np.sqrt(-float('inf')) have different values.

I also wonder if this was discussed in #27901, because the existing scalar path added in this PR also has this problem. @seberg any thoughts?

r-devulap · 2025-02-18T20:52:22Z

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src

+        const npy_intp ssrc2 = steps[1] / sizeof(@type@);
+        const npy_intp sdst  = steps[2] / sizeof(@type@);
+
+        if (stride_zero) {            


No worries. I benchmarked this on a TGL:

| Change | Before [79f44308] <main> | After [9a0fa57f] <pow_fast_paths> | Ratio | Benchmark (Parameter) | |----------|----------------------------|-------------------------------------|---------|---------------------------------------------------------------------| | + | 234±2μs | 466±10μs | 1.99 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>) | | - | 1.89±0.02ms | 1.76±0.01ms | 0.94 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) | | - | 835±20μs | 745±9μs | 0.89 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>) | | - | 2.30±0.2ms | 1.77±0.01ms | 0.77 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) | | - | 2.18±0ms | 1.10±0ms | 0.5 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>) | | - | 628±1μs | 249±2μs | 0.4 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>) | | - | 1.10±0.01ms | 286±1μs | 0.26 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>) |

The 2x regression on arrr ** 2 is a bit odd. Any idea why that can happen?

MaanasArora · 2025-02-18T21:38:44Z

Thanks and no worries! It was indeed discussed in that PR--we discussed the possibility to dispatch np.sqrt rather than do the fast paths using the C intrinsic, but was having some trouble getting it to dispatch properly.

If that works we could always optimize square and sqrt and then delegate the entire fast path to those functions; that will probably simplify things a lot (and also solve the benchmarking issue, if my understanding of the cause was correct)!

That seems to be out of scope for this patch but we probably cannot merge this with such a large regression...

tylerjereddy · 2025-02-18T21:56:56Z

numpy/_core/tests/test_umath.py

+    def test_large_fast_power(self):
+        # gh-28248
+        for dt in [np.float32, np.float64]:
+            a = np.random.uniform(1, 4, 1000).astype(dt)


might be good to pin the random seed/Generator with the usual i.e., default_rng() approach here?

This is done, thanks!

r-devulap · 2025-02-20T20:50:22Z

Could you fix the CI failure on arm? Updating ULP error to 2 should fix it (for float32, we are usually tolerant up to 3ULP).

MaanasArora · 2025-02-20T21:02:30Z

Yes, done.

MaanasArora · 2025-03-24T17:44:45Z

If that works we could always optimize square and sqrt and then delegate the entire fast path to those functions; that will probably simplify things a lot

I think I can do this soon--is it best if I do it in this PR? Thank you!

MaanasArora · 2025-06-11T15:43:15Z

Closing this for now—happy to reopen if there's renewed interest!

MaanasArora added 3 commits January 29, 2025 18:03

ENH: Add SIMD AVX512 optimization to float power fast paths

72b631f

ENH: Add SIMD equivalents for all float power fast paths

ceb1839

BUG: Remove inconsistent float power fast paths across (no) SIMD for now

f075dfa

MaanasArora changed the title ~~Feature/simd float power fast paths~~ ENH: AXV512 SIMD optimizations for float power fast paths Jan 29, 2025

r-devulap reviewed Jan 30, 2025

View reviewed changes

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src Show resolved Hide resolved

BUG: Remove unneccessary SVML required macro in SIMD optimization

103de2c

r-devulap requested changes Jan 31, 2025

View reviewed changes

MAINT: Simplify SIMD optimization flow for float power fast paths

5249f14

MaanasArora requested a review from r-devulap February 2, 2025 16:00

r-devulap requested changes Feb 5, 2025

View reviewed changes

MaanasArora added 2 commits February 6, 2025 14:53

ENH: Add test for large array power fast paths

934e900

BUG: Fix missing outer loop in SIMD power fast path optimization

9a0fa57

r-devulap reviewed Feb 7, 2025

View reviewed changes

numpy/_core/tests/test_umath.py Outdated Show resolved Hide resolved

TST: Allow last digit one unit difference in large fast power test

0e996e6

Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>

r-devulap reviewed Feb 18, 2025

View reviewed changes

tylerjereddy reviewed Feb 18, 2025

View reviewed changes

MaanasArora added 2 commits February 21, 2025 02:29

MAINT: Pin random number generation in fast power test

78e1ea4

TST: Update maximum ulp in large fast power test

2d3d017

r-devulap self-assigned this Mar 3, 2025

MaanasArora closed this Jun 11, 2025

Uh oh!

ENH: AXV512 SIMD optimizations for float power fast paths #28248

ENH: AXV512 SIMD optimizations for float power fast paths #28248

Uh oh!

Conversation

MaanasArora commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Jan 29, 2025

Uh oh!

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

MaanasArora commented Feb 1, 2025

Uh oh!

MaanasArora commented Feb 1, 2025

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-devulap commented Feb 20, 2025

Uh oh!

MaanasArora commented Feb 20, 2025

Uh oh!

MaanasArora commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Jun 11, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

MaanasArora commented Jan 29, 2025 •

edited

Loading

MaanasArora Feb 18, 2025 •

edited

Loading

MaanasArora commented Feb 18, 2025 •

edited

Loading

MaanasArora commented Mar 24, 2025 •

edited

Loading