Skip to content

ENH: AXV512 SIMD optimizations for float power fast paths #28248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

MaanasArora
Copy link
Contributor

@MaanasArora MaanasArora commented Jan 29, 2025

Adds SIMD optimizations to float power fast paths (np.power) under AXV512.

This is shared somewhat as a start, since I wondered if this optimization is too granular and complicates things; though I think these are common enough to be valuable. It can be cleaned up further once we have optimizations for np.reciprocal,np.square, etc.

Thank you for reviewing!

@MaanasArora
Copy link
Contributor Author

Update: made a few "simplifying" changes that removed some functionality to fix some bugs. Will try to add it back again more carefully.

@MaanasArora MaanasArora changed the title Feature/simd float power fast paths ENH: AXV512 SIMD optimizations for float power fast paths Jan 29, 2025
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function already has a #ifdef AVX512 block. Can we simplify this patch by moving the SIMD code to that portion? I suggest we add the special case to simd_pow_f32/f64 function defined above.

@MaanasArora
Copy link
Contributor Author

@r-devulap Sure! Maybe we should create a new helper function though, because putting all the zero-stride logic into simd_pow_f32/f64 will need us to pass some information about the stride being zero, in order to make use of the optimizations? We should probably not break the argument structure for an optimization case?

@MaanasArora
Copy link
Contributor Author

I've pushed a simplification using a helper function. Hope that suffices!

@MaanasArora MaanasArora requested a review from r-devulap February 2, 2025 16:00
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current patch has a bug. Please do add a test to test the new code paths. Looks like we do not have coverage.

const npy_intp ssrc2 = steps[1] / sizeof(@type@);
const npy_intp sdst = steps[2] / sizeof(@type@);

if (stride_zero) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if condition here isn't ideal and slows down the function which doesn't hit the special paths. After you fix the loop, could you provide benchmark numbers on AVX-512?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will share. I don't have AXV-512 on my laptop but can test using Intel's simulator (that NumPy tests also use) if that would work for performance too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is only useful for testing and not benchmarking. Benchmarking numbers wont be reliable using an emulator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, do you have any recommendations for how we can proceed in that case, without supported hardware for benchmarking? Sorry about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I benchmarked this on a TGL:

| Change   | Before [79f44308] <main>   | After [9a0fa57f] <pow_fast_paths>   |   Ratio | Benchmark (Parameter)                                               |
|----------|----------------------------|-------------------------------------|---------|---------------------------------------------------------------------|
| +        | 234±2μs                    | 466±10μs                            |    1.99 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>)      |
| -        | 1.89±0.02ms                | 1.76±0.01ms                         |    0.94 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
| -        | 835±20μs                   | 745±9μs                             |    0.89 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
| -        | 2.30±0.2ms                 | 1.77±0.01ms                         |    0.77 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| -        | 2.18±0ms                   | 1.10±0ms                            |    0.5  | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
| -        | 628±1μs                    | 249±2μs                             |    0.4  | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
| -        | 1.10±0.01ms                | 286±1μs                             |    0.26 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |

The 2x regression on arrr ** 2 is a bit odd. Any idea why that can happen?

Copy link
Contributor Author

@MaanasArora MaanasArora Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I suspect it's related to the fact that 2 is an integer scalar and it is probably taking the fast_scalar_power function rather than promoting the types and hitting the UFunc at all. fast_scalar_power calls np.square. If square is not optimized for SIMD, or something else in the implementation diverges, that might cause it to be slower than the power function which was already baseline SIMD-optimized.

PS. Probably not exactly right sorry; it should also have hit square in main! Perhaps some bug around fast_scalar_power is causing the operation to run twice, like a failing equality or error check...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its bit odd. I tracked down the function call stack for both main and your branch and both of them trigger FLOAT_square_SSE41 which has nothing to do with your patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I confirmed that using GDB as well, so there's isn't the bug I suspected. The calls should be identical...

Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com>
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for doing the review in pieces. But I have one main concern to address which was also a problem in #26055. From that PR (see #26055 (comment)):

The issue is that np.power(-float('inf'), .5) and np.sqrt(-float('inf')) have different values.

I also wonder if this was discussed in #27901, because the existing scalar path added in this PR also has this problem. @seberg any thoughts?

const npy_intp ssrc2 = steps[1] / sizeof(@type@);
const npy_intp sdst = steps[2] / sizeof(@type@);

if (stride_zero) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I benchmarked this on a TGL:

| Change   | Before [79f44308] <main>   | After [9a0fa57f] <pow_fast_paths>   |   Ratio | Benchmark (Parameter)                                               |
|----------|----------------------------|-------------------------------------|---------|---------------------------------------------------------------------|
| +        | 234±2μs                    | 466±10μs                            |    1.99 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>)      |
| -        | 1.89±0.02ms                | 1.76±0.01ms                         |    0.94 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
| -        | 835±20μs                   | 745±9μs                             |    0.89 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
| -        | 2.30±0.2ms                 | 1.77±0.01ms                         |    0.77 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| -        | 2.18±0ms                   | 1.10±0ms                            |    0.5  | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
| -        | 628±1μs                    | 249±2μs                             |    0.4  | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
| -        | 1.10±0.01ms                | 286±1μs                             |    0.26 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |

The 2x regression on arrr ** 2 is a bit odd. Any idea why that can happen?

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Feb 18, 2025

Thanks and no worries! It was indeed discussed in that PR--we discussed the possibility to dispatch np.sqrt rather than do the fast paths using the C intrinsic, but was having some trouble getting it to dispatch properly.

If that works we could always optimize square and sqrt and then delegate the entire fast path to those functions; that will probably simplify things a lot (and also solve the benchmarking issue, if my understanding of the cause was correct)!

That seems to be out of scope for this patch but we probably cannot merge this with such a large regression...

def test_large_fast_power(self):
# gh-28248
for dt in [np.float32, np.float64]:
a = np.random.uniform(1, 4, 1000).astype(dt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to pin the random seed/Generator with the usual i.e., default_rng() approach here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, thanks!

@r-devulap
Copy link
Member

Could you fix the CI failure on arm? Updating ULP error to 2 should fix it (for float32, we are usually tolerant up to 3ULP).

@MaanasArora
Copy link
Contributor Author

Yes, done.

@r-devulap r-devulap self-assigned this Mar 3, 2025
@MaanasArora
Copy link
Contributor Author

MaanasArora commented Mar 24, 2025

If that works we could always optimize square and sqrt and then delegate the entire fast path to those functions; that will probably simplify things a lot

I think I can do this soon--is it best if I do it in this PR? Thank you!

@MaanasArora
Copy link
Contributor Author

Closing this for now—happy to reopen if there's renewed interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy