MAINT: simplify power fast path logic #27901

MaanasArora · 2024-12-04T01:48:29Z

This is an initial draft to resolve #27082.

I have removed the fast paths from array_power and am planning to implement them in the individual UFunc templates which seem to be few. This prevents the need for scalar extraction and reduces divergence from the UFunc. Currently, I have only implemented for integer loops, but would appreciate feedback before I proceed.

I hope I understood the issue correctly. Thank you!

seberg

Interesting, thanks for the start! I was worried that int_arr**-1 may have changed, but it always error'd in both versions :).

So, this would certainly fix the issue and is a very nice amount of code deletions.

I think the main thing that we need to do is ensure that there is no (big) speed regression with any of the cases that are currently fast-pathed.
(That is all the cases for floats and ** 2 for integers, I think.)

We may already have benchmarks for these (not sure), if that is so, then just running them and showing the result would be sufficient or start with trying manually and then we see if we should add benchmarks.

numpy/_core/tests/test_multiarray.py

seberg · 2024-12-04T08:56:02Z

numpy/_core/src/umath/loops.c.src

        }
-        if (in1 == 1) {
+        else if (in1 == 1) {
            *((@type@ *)op1) = 1;


I am a bit confused by the new (and existing) fast paths here. The first branch is the scalar case and it would seem to me that any fast path is even more relevant there?
Should the fast path just be copied? Maybe there should be a second helper with the body so that we can have a if (stride[1] == 0) {call helper()} else {call_helper()} to nudge the compiler to optimize for 0 stride (and assume it'll lift the checks out of the loop then).

To be clear, the if (steps[1]==0) { path (sorry, I wrote stride above, the ufunc code here calls it "steps") is the path we need to worry about being fast.

I.e. if the second operand is a scalar, then we will always take that part (I am not 100% sure about scalar**scalar.).

Yes, I stuck to the original fast path logic but I agree the scalar case is more important for efficiency. I've created a helper function to remove repetition.

MaanasArora · 2024-12-04T10:48:00Z

Thanks for the review :)

We do seem to have benchmarks (BinaryBench) for powers. I ran it for the cases I have written so far (integers).

This is the output without the changes in this PR:

[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   9.56±0.03ms 
              numpy.int64   9.52±0.05ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                    ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32     3.90±0ms  
              numpy.int64   4.34±0.01ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                     ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.35±0.05ms 
               numpy.int64    3.64±0.1ms 
              ============= =============```

And with changes:

```[85.71%] ··· bench_ufunc.BinaryBenchInteger.time_pow                         ok
[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32    10.9±0.1ms 
              numpy.int64   10.6±0.02ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                    ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   3.33±0.05ms 
              numpy.int64    3.67±0.2ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                     ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.19±0.04ms 
               numpy.int64    3.65±0.2ms 
              ============= =============

I suppose time_pow is somewhat concerning, though I'm not exactly sure why--perhaps because of the scalar checking? As it is the only benchmark where b (exponent) is an array. Can probably be improved.

seberg · 2024-12-04T11:00:27Z

Can we (I suppose we can) trust that the compiler is smart enough to lift that if (step[1] == 0) out of the loop?
Either way, I don't really think a 5% slowdown is super concerning, and it may also be partially a fluke.

The main difference should be that for **2 the old code didn't have to create np.array(2) to do the operation. So adding these fast-paths sacrifices that.

Not sure we should be concerned about it, but if anyone is, an exact fast path for common Python integers only to call square or reciprocal could still make sense.
(Exact fast path for integers, because that is the normal code and makes the fast-path code very simple.)

EDIT: Of course **0.5 (exact python float 0.5) matching np.sqrt also makes sense. The point is always the exact, ignoring even subclasses.

MaanasArora · 2024-12-04T12:24:50Z

Yes, a surface-level check in array_power for Python integers and floats (exact type matches) should be a good trade-off, so that operators don't have to cast to numpy. Going to implement that, and the float fast paths. Thanks.

…ject

MaanasArora · 2024-12-04T13:56:30Z

Both float fast paths and the array power fast paths are complete!

It seems a specific test under test_regression is breaking. As far as I can tell, the issue seems to be that, because of calling reciprocal on the object arrays during execution of linalg.norm, there is a RuntimeWarning (zero division) instead of a ValueError. On inspecting the other occurrences, it doesn't seem like the other ValueErrors are exactly the same either, so I'm assuming I can change the test to RuntimeWarning. It does seem inconsistent, but I suppose it indicates a problem with the test logic?

MaanasArora · 2024-12-05T05:01:34Z

The regression test seems to concern norms on object arrays. I have updated the test to consider a larger amount of possible exceptions. Perhaps if someone was using ValueErrors to guard against doing norms on object arrays, it would be a concern, though it seems obscure to me.

Aside from this, my PR should be ready! I will print the complete benchmarks soon.

MaanasArora · 2024-12-05T05:46:10Z

My bad, just fixed an erroneously declared extra variable in the loops.

MaanasArora · 2024-12-05T06:55:42Z

Wrote an enhancing check to improve speed when the exponent is a scalar.

Here are the benchmarks with changes:

[57.14%] ··· bench_ufunc.BinaryBench.time_atan2                                                                                            ok
[57.14%] ··· =============== ============
                  dtype                  
             --------------- ------------
              numpy.float32   23.7±0.1ms 
              numpy.float64   19.4±0.1ms 
             =============== ============

[64.29%] ··· bench_ufunc.BinaryBench.time_pow                                                                                              ok
[64.29%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   7.40±0.04ms 
              numpy.float64   12.0±0.07ms 
             =============== =============

[71.43%] ··· bench_ufunc.BinaryBench.time_pow_2                                                                                            ok
[71.43%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   1.97±0.01ms 
              numpy.float64   2.14±0.01ms 
             =============== =============

[78.57%] ··· bench_ufunc.BinaryBench.time_pow_half                                                                                         ok
[78.57%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   2.39±0.03ms 
              numpy.float64    2.54±0.3ms 
             =============== =============

[85.71%] ··· bench_ufunc.BinaryBenchInteger.time_pow                                                                                       ok
[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   10.2±0.06ms 
              numpy.int64   10.2±0.07ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                                                                                  ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   3.66±0.02ms 
              numpy.int64    3.93±0.5ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                                                                                   ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.12±0.01ms 
               numpy.int64    3.37±0.2ms 
              ============= =============

And without:

[57.14%] ··· bench_ufunc.BinaryBench.time_atan2                                                                                            ok
[57.14%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   23.6±0.02ms 
              numpy.float64   19.3±0.03ms 
             =============== =============

[64.29%] ··· bench_ufunc.BinaryBench.time_pow                                                                                              ok
[64.29%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   7.43±0.02ms 
              numpy.float64     11.9±0ms  
             =============== =============

[71.43%] ··· bench_ufunc.BinaryBench.time_pow_2                                                                                            ok
[71.43%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   2.04±0.01ms 
              numpy.float64   1.99±0.01ms 
             =============== =============

[78.57%] ··· bench_ufunc.BinaryBench.time_pow_half                                                                                         ok
[78.57%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   7.41±0.01ms 
              numpy.float64    12.1±0.2ms 
             =============== =============

[85.71%] ··· bench_ufunc.BinaryBenchInteger.time_pow                                                                                       ok
[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   9.53±0.06ms 
              numpy.int64    9.74±0.3ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                                                                                  ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   3.88±0.02ms 
              numpy.int64    3.97±0.2ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                                                                                   ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.28±0.01ms 
               numpy.int64   3.49±0.01ms 
              ============= =============```
</details>

seberg

I hope you don't mind if there may be a bit more iteration. You were so active that I thought I should put in some feedback.

Overall, this looks good. Need to rethink the new fast-paths once and would like to double check on what changed exactly in those tests.

numpy/_core/src/multiarray/number.c

seberg · 2024-12-04T13:55:57Z

numpy/_core/src/multiarray/number.c

+{
+    if (!PyArray_Check(o1)) {
+        return -1;
+    }


Do we really need this?

It seems removing it causes segmentation faults in the tests. Perhaps the compiler uses it to optimize?

Sorry, i was going to delete it and forgot. If (and only if) the later exact checks match then the PyArray_ISOBJECT can be done safely (although some code paths used to call this with invalid inputs, I don't think they do anymore).

So you can probably remove it, but it would require re-organizing the PyArray_ISOBJECT check to the end.
(If it still segfaults, some scalar path calls in here.)

Ah, so that it would be memory-safe in case of the nonzero returns! Thanks--this is done.

seberg · 2024-12-04T13:57:10Z

numpy/_core/src/multiarray/number.c

+    PyArrayObject *a1 = (PyArrayObject *)o1;
+    if (PyArray_ISOBJECT(a1)) {
+        return -1;
+    }


Nice! Yes, object is special and we cannot do this (unless we changed the object ufuncs, but not sure that would be right for **0.5 at least.

One nitpick, though: Please don't use -1 as "not taken". -1 typically means error occurred, so I think it is confusing.

Makes sense! Fixed this.

seberg · 2024-12-05T10:33:06Z

numpy/linalg/tests/test_regression.py

-        assert_raises(ValueError, linalg.norm, testvector, ord=-2)
+        for ord in ['fro', 'nuc', np.inf, -np.inf, 0, -1, -2]:
+            pytest.raises(
+                (ValueError, ZeroDivisionError, RuntimeWarning),


Can you maybe say an example of the core operation that changed (i.e. what kind of array to the power of what kind of scalar this ends up at)?
I would generally not worry about the type of error. But the RuntimeWarning is only an error in CI, maybe this gives a warning and then later raises or maybe not.

I agree those changes are probably fine, but I would feel better seeing an example (I can dig that up myself, but maybe you got it).

It seems that the norm function does some casts on the object array:

if not issubclass(x.dtype.type, (inexact, object_)): x = x.astype(float)

As far as I can tell, the regression test checks if this cast works. Probably why there are cases for the arrays being equal and the dtype being float64.

With the fast paths, we call reciprocal directly, which raises a ZeroDivisionError or RuntimeWarning (depending on deployment) instead of a ValueError.

I can provide a concrete traceback soon.

Can you provide that traceback/details (I assume it's quick for you)? The RuntimeWarning case is the interesting one. If we ignore the RuntimeWarning, will we hit a ValueError later?

If we don't hit any error in that path, then we need to think about it more carefully.
(And also adjust the test, if just to add with np.errstate(something="raise"):).

numpy/_core/tests/test_multiarray.py

seberg · 2024-12-05T10:38:38Z

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src

+        else if (in2 == 0.5) {
+            BINARY_LOOP_SLIDING {
+                const @type@ in1 = *(@type@ *)ip1;
+                *(@type@ *)op1 = sqrt(in1);


This would need to use some type specific sqrt function. More importantly, though, we won't get all the optimized versions here.

You should probably call @TYPE@_sqrt (which includes the loop), but I think that might not dispatch right. @seiko2plus can probably say instantly what to put here to call sqrt directly.

Makes sense! I'll try to get @TYPE@_sqrt to dispatch right.

It seems that a C type specific sqrt function is assigned on line 229:

* #sqrt = sqrt, sqrtf#

I suppose it is still better for consistency to use the UFunc implementation.

On looking it up it seems @TYPE@_sqrt is indeed difficult to dispatch right without allocating a new array, so we would indeed need another call.

Given that type-specific functions are defined though, I'm wondering if just pushing all of these conditions into the BINARY_LOOP_SLIDING loop and seeing if the compiler optimizes might be worth the tradeoff?

I just used @sqrt@ and restructured the loop to include the conditions. It doesn't seem to have hurt performance and looks cleaner.

seberg · 2024-12-05T10:39:12Z

numpy/_core/src/umath/loops_umath_fp.dispatch.c.src

+            }
+            return;
+        }
+        else if (in2 == 2.0) {
            BINARY_LOOP_SLIDING {
                const @type@ in1 = *(@type@ *)ip1;
                *(@type@ *)op1 = in1 * in1;


In fact, here we also should maybe just call the normal multiply loop (with duplicated op1 pointer).

Will do with the SQRT!

seberg · 2024-12-05T10:44:05Z

numpy/_core/src/umath/loops.c.src

+            BINARY_LOOP_SLIDING {
+                @type@ in1 = *(@type@ *)ip1;
+                if (_@TYPE@_power_fast_path_helper(in1, in2, (@type@ *)op1) != 0) {
+                    *((@type@ *) op1) = _@TYPE@_squared_exponentiation_helper(in1, in2start, first_bit);


Hmmm, this can never be taken right if we have the outer if? Maybe put an assert(0) /* unreachable */ to clarify or so?

(Maybe it was nicer without the outer check and the compiler will do it anyway, but this is good).

Actually I think it can, because of the in1 != 0 check. The memory overlap symmetry tests fail when I do not check if in1 == 0 to do the fast paths. It seems zeros play a special role in memory overlap? (We would not need the outer check for efficiency if not for this case.)

Edit: to clarify, the outer check is for efficiency, as the compiler cannot be relied upon to simplify if we have the in2 != 0 check the way it is (though perhaps there is a way to write things to improve a lot of this -- investigating that.)

Okay, so it seems some of the boolean logic was the issue for the symmetry checks. Unfortunately it doesn't seem like the compiler optimizes either way, inside the loop. So I created a boolean variable to keep track of whether any fastop check has failed. It looks a bit confusing, though it improved the benchmarks.

Update: seems like the compiler can optimize if the check boolean is declared outside of the loop, and given the boolean logic in the helper is fixed, we have a better looking solution now.

seberg · 2024-12-05T10:51:45Z

numpy/_core/src/multiarray/number.c

+    if (PyLong_CheckExact(o2)) {
+        long exp = PyLong_AsLong(o2);
+        if (error_converting(exp)) {
+            PyErr_Clear();


I think this is fine and happy to keep it as is. In general blanket PyErr_Clear() has a slight "code smell" (we still have plenty of them in NumPy, though).

This is possible to remove with PyLong_AsLongAndOverflow because if an error occurs with that, I think we can assume it is a critical error (KeyboardInterrupt).

Makes sense! Made this change.

MaanasArora · 2024-12-05T12:51:38Z

No, don't mind at all! I appreciate the feedback.

I am rethinking some of the logic I built to pass the symmetry checks. Perhaps it will help clean up the code further.

MaanasArora · 2024-12-06T05:38:21Z

Made several changes, including adding the reciprocal fast path to floats though it probably needs to be optimized. Thanks for reviewing.

MaanasArora · 2024-12-10T11:09:40Z

After running the benchmarks, it seems that the performance decreases could indeed be flukes, and on the whole, checking for Python scalars might have actually reduced performance by a slight amount. I removed them to decrease the amount of code. If we would still like to have them, happy to add back.

I couldn't find an efficient way to dispatch @TYPE@_sqrt and the others; they required moving around the memory. To be fair, these new fast paths are in the UFuncs for specific types.

Here are the compared benchmarks (run with asv run -a repeat=50 -a rounds=10 to reduce noise):

| Change   | Before [5047f7be] <master>   | After [b5f8a2be] <maint/simpler-power-fast-paths>   |   Ratio | Benchmark (Parameter)                                               |
|----------|------------------------------|-----------------------------------------------------|---------|---------------------------------------------------------------------|
|          | 17.9±0.3ms                   | 17.7±0.09ms                                         |    0.99 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float32'>)         |
|          | 14.9±0.1ms                   | 14.8±0.1ms                                          |    0.99 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float64'>)         |
|          | 4.93±0.2ms                   | 4.96±0.06ms                                         |    1.01 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)           |
|          | 10.5±0.5ms                   | 10.5±0.07ms                                         |    1    | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float64'>)           |
|          | 493±30μs                     | 540±20μs                                            |    1.1  | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
| -        | 937±200μs                    | 581±30μs                                            |    0.62 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
| -        | 4.94±0.04ms                  | 802±6μs                                             |    0.16 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |
| -        | 10.4±0.06ms                  | 1.61±0.02ms                                         |    0.15 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
| +        | 5.27±0.1ms                   | 6.85±0.04ms                                         |    1.3  | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int32'>)      |
| +        | 5.56±0.2ms                   | 6.92±0.1ms                                          |    1.25 | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int64'>)      |
|          | 1.25±0.04ms                  | 1.33±0.01ms                                         |    1.07 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
|          | 1.42±0.2ms                   | 1.36±0.04ms                                         |    0.96 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| +        | 846±50μs                     | 1.16±0.03ms                                         |    1.38 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int32'>)  |
|          | 1.15±0.2ms                   | 1.17±0.1ms                                          |    1.01 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int64'>)  |

This should be ready for review! Thank you.

Edit: there seems to be a test fail with an absolute difference ~2e-7 in one specific environment. Seems to do with the sqrt implementation, though I'm not sure quite what to do about it (doubt it is worth it to add back the Python scalar checks for a very specific occasion--could it partly be a fluke?)

seberg

Nice. I am happy to not have any fast paths. The only worry I have are the fast paths for float arrays getting slower.

Can you check (if just with %timeit the speed of large_float_arr**0.5 and large_float_arr**2?
I suspect we don't have any benchmarks for the operators here (you could add them of course).

@seiko2plus if you could give a hand and see if we can just forward the call to the actual SIMD implementations easily that would be nice.
(It's a bit annoying, because we have to build new strides/pointers though...)

So, while I would love to just get rid of the fast-paths. I do suspect we need them at least for floats to avoid a regression.
(The ufuncs can do it now in theory, but I think it is too much hassle to do in this PR unless we can do it hassle free enough from inside the inner-loop.)

seberg · 2024-12-11T09:19:27Z

numpy/linalg/tests/test_regression.py

-        assert_raises(ValueError, linalg.norm, testvector, ord=-2)
+        for ord in ['fro', 'nuc', np.inf, -np.inf, 0, -1, -2]:
+            pytest.raises(
+                (ValueError, ZeroDivisionError, RuntimeWarning),


Can you provide that traceback/details (I assume it's quick for you)? The RuntimeWarning case is the interesting one. If we ignore the RuntimeWarning, will we hit a ValueError later?

If we don't hit any error in that path, then we need to think about it more carefully.
(And also adjust the test, if just to add with np.errstate(something="raise"):).

MaanasArora · 2024-12-11T12:51:28Z

Thanks! That makes sense. Added the benchmarks. Yes, float operators did get slower:

| Change   | Before [5047f7be] <master>   | After [b5f8a2be] <maint/simpler-power-fast-paths>   |   Ratio | Benchmark (Parameter)                                               |
|----------|------------------------------|-----------------------------------------------------|---------|---------------------------------------------------------------------|
|          | 17.7±0.05ms                  | 17.9±0.4ms                                          |    1.01 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float32'>)         |
|          | 14.9±0.2ms                   | 15.0±0.09ms                                         |    1.01 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float64'>)         |
|          | 4.92±0.05ms                  | 5.01±0.09ms                                         |    1.02 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)           |
|          | 10.5±0.2ms                   | 10.7±0.3ms                                          |    1.01 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float64'>)           |
| +        | 419±5μs                      | 629±60μs                                            |    1.5  | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
| +        | 472±50μs                     | 1.16±0.5ms                                          |    2.45 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
v
| +        | 165±8μs                      | 596±70μs                                            |    3.61 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>)      |
| +        | 344±5μs                      | 1.04±0.3ms                                          |    3.01 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float64'>)      |
^
| -        | 4.88±0.04ms                  | 893±50μs                                            |    0.18 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |
| -        | 10.3±0.05ms                  | 2.19±0.2ms                                          |    0.21 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
v
| +        | 201±0.4μs                    | 915±40μs                                            |    4.55 | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float32'>)   |
| +        | 806±2μs                      | 1.96±0.3ms                                          |    2.43 | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float64'>)   |
^
| +        | 5.12±0.02ms                  | 7.44±0.3ms                                          |    1.45 | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int32'>)      |
| +        | 5.35±0.07ms                  | 7.02±0.3ms                                          |    1.31 | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int64'>)      |
| +        | 1.20±0ms                     | 1.50±0.07ms                                         |    1.25 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
| +        | 1.25±0.02ms                  | 1.70±0.3ms                                          |    1.35 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| +        | 802±2μs                      | 1.31±0.1ms                                          |    1.63 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int32'>)  |
| +        | 823±20μs                     | 1.55±0.5ms                                          |    1.88 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int64'>)  |

I've added the fast paths back:

| Change   | Before [5047f7be] <master>   | After [efbd6b83] <maint/simpler-power-fast-paths>   | Ratio   | Benchmark (Parameter)                                               |
|----------|------------------------------|-----------------------------------------------------|---------|---------------------------------------------------------------------|
|          | 17.9±0.4ms                   | 17.9±0.05ms                                         | 1.00    | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float32'>)         |
|          | 15.0±0.5ms                   | 14.7±0.2ms                                          | 0.98    | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float64'>)         |
|          | 4.99±0.1ms                   | 4.97±0.04ms                                         | 1.00    | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)           |
|          | 10.4±0.1ms                   | 10.4±0.07ms                                         | 1.01    | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float64'>)           |
|          | 500±30μs                     | 534±2μs                                             | 1.07    | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
|          | 836±500μs                    | 557±30μs                                            | ~0.67   | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
|          | 210±50μs                     | 167±2μs                                             | ~0.79   | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>)      |
|          | 404±100μs                    | 347±20μs                                            | ~0.86   | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float64'>)      |
| -        | 4.96±0.04ms                  | 797±0.9μs                                           | 0.16    | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |
| -        | 10.6±0.1ms                   | 1.60±0ms                                            | 0.15    | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
|          | 202±10μs                     | 201±0.6μs                                           | 1.00    | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float32'>)   |
|          | 811±50μs                     | 806±2μs                                             | 0.99    | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float64'>)   |
| +        | 5.18±0.1ms                   | 6.82±0.02ms                                         | 1.32    | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int32'>)      |
| +        | 5.39±0.2ms                   | 6.97±0.1ms                                          | 1.29    | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int64'>)      |
|          | 1.19±0.04ms                  | 1.33±0ms                                            | ~1.11   | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
|          | 1.26±0.1ms                   | 1.35±0.01ms                                         | 1.07    | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| +        | 797±20μs                     | 1.13±0.02ms                                         | 1.42    | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int32'>)  |
|          | 895±200μs                    | 1.14±0.01ms                                         | ~1.28   | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int64'>)  |

The RuntimeWarning case, which occurs when ord = -1, does eventually trigger a ZeroDivisionError as well:

In [1]: from numpy import linalg

In [2]: testvector = np.array([np.array([0, 1]), 0, 0], dtype=object)

In [3]: linalg.norm(testvector, ord=-1)
/home/maanas/Documents/open-source/numpy/build-install/usr/lib/python3.12/site-packages/numpy/linalg/_linalg.py:2788: RuntimeWarning: divide by zero encountered in reciprocal
  absx **= ord
/home/maanas/Documents/open-source/numpy/build-install/usr/lib/python3.12/site-packages/numpy/linalg/_linalg.py:2788: RuntimeWarning: invalid value encountered in reciprocal
  absx **= ord
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 linalg.norm(testvector, ord=-1)

File ~/Documents/open-source/numpy/build-install/usr/lib/python3.12/site-packages/numpy/linalg/_linalg.py:2788, in norm(x, ord, axis, keepdims)
   2786 else:
   2787     absx = abs(x)
-> 2788     absx **= ord
   2789     ret = add.reduce(absx, axis=axis, keepdims=keepdims)
   2790     ret **= reciprocal(ord, dtype=ret.dtype)

ZeroDivisionError: 0.0 cannot be raised to a negative power

If that looks fine, I could look into adjusting the test for sure, to ensure the warning does trigger one of the errors!

seberg · 2024-12-11T13:13:48Z

What happens in the polynomials in the end is fine, maybe. But the underlying issue is not fine.

If you look into it e.g. with pdb %pdb in ipython, you will see that this kicks in:

>>> np.array([0, 1])**-1
<stdin>:1: RuntimeWarning: divide by zero encountered in reciprocal
<stdin>:1: RuntimeWarning: invalid value encountered in reciprocal
array([9223372036854775807,                   1])  # result is platform dependent

but that piece of code must raise:

ValueError: Integers to negative integer powers are not allowed.

Not sure why, did you add a fast path for reciprocals which are not defined?

MaanasArora · 2024-12-11T13:27:47Z

Yes, you're right, sorry for the oversight. It seems to be at the scalar level, where it seems I did not disallow any fast paths for integers. I'll fix that.

MaanasArora · 2024-12-11T13:35:00Z

I have restricted the scalar fast paths to float and complex arrays. The regression test is now reverted to its original state. Doing more testing to ensure everything works as expected.

seberg

Thanks @MaanasArora, I think we can basically put this in, and although I would like @seiko2plus (or @r-devulap) to have a look at the power changes because I suspect we can do better, I'll just put it in soon anyway.

The change might even make things worse, but only on avx512-skx, and otherwise the fast-path seems fine to me (whether it changes the results by a tiny bit or not).

seberg · 2025-01-07T13:55:51Z

OK, let's give this a shot, thanks. Still might be nice to think about how to nicely fall back to the specialized function for the fast-paths.

MaanasArora · 2025-01-07T17:03:40Z

Thank you! Happy to continue helping with this!

MaanasArora added 2 commits December 3, 2024 20:37

MAINT: remove fast paths from array power

6923108

MAINT: Add fast paths to power loops

8196fbb

MaanasArora changed the title ~~Maint/simpler power fast paths~~ MAINT: move power fast path logic out of array power to umath loops Dec 4, 2024

MAINT: Clean loops for integer power in umath

4090a1c

MaanasArora changed the title ~~MAINT: move power fast path logic out of array power to umath loops~~ MAINT: simplify power fast path logic Dec 4, 2024

MAINT: Remove blocking regression test for power fast paths

91dd9dd

seberg reviewed Dec 4, 2024

View reviewed changes

MaanasArora added 3 commits December 4, 2024 05:26

MAINT: Add helper function for power fast paths

c0e88c8

BUG: Change misspelled bitwise and to logical and

c00489d

BUG: Fix missing value on power helper return

1d9f355

BUG: Fix exponent bitwise logic in power fast paths

4b2920c

MaanasArora added 3 commits December 4, 2024 07:40

MAINT: Add power fast paths to floating point umath

dd6e773

MAINT: Add fast power paths to array power when exponent is python ob…

c95bff2

…ject

MAINT: Fix division by zero runtime warning in test regression

084416e

MAINT: Adapt object regression test for linalg to power fast paths

e23423c

MAINT: Remove incorrect declarations in power fast paths

7cad24e

MaanasArora added 3 commits December 5, 2024 01:19

MAINT: Reduce calls to power fast path helper when scalar is ineligible

c028996

MAINT: Fix missing sliding loop

3297309

BUG: Fix syntax error

455407f

seberg reviewed Dec 5, 2024

View reviewed changes

MaanasArora added 2 commits December 5, 2024 07:34

MAINT: Fix semantic misuse of -1 for non-error returns

df6f54a

MAINT: Improve error checking in power fast paths to remove PyErr_Clear

d10bce5

MaanasArora added 4 commits December 5, 2024 08:04

MAINT: Improve type checking in power fast paths

21c12a6

MAINT: Efficient handling of ones arrays in scalar fast paths

c9929ff

MAINT: Simplify outer check for scalar power fast paths

ed449e7

MAINT: Reduce code reuse in float power fast paths and add reciprocal

ba24783

MAINT: Remove Python scalar checking for fast power paths

b5f8a2b

seberg reviewed Dec 11, 2024

View reviewed changes

MaanasArora added 3 commits December 11, 2024 07:10

MAINT: Add benchmarks for power operators in float binary bench

023d55a

MAINT: Add scalar power fast paths

efbd6b8

BUG: Add missing pointer cast

31418e9

BUG: Allow scalar power fast paths only for non-integers

a5d0ef4

seberg requested a review from seiko2plus December 12, 2024 11:06

MAINT: Restore outdated changes in regression test to master

2e0f6ca

seberg approved these changes Dec 30, 2024

View reviewed changes

seberg merged commit 52162af into numpy:main Jan 7, 2025
67 checks passed

agriyakhetarpal mentioned this pull request Jan 8, 2025

New nightly test failure for NumPy 2.3.0dev0 on 08/01/2024 HIPS/autograd#671

Open

r-devulap mentioned this pull request Feb 18, 2025

ENH: AXV512 SIMD optimizations for float power fast paths #28248

Closed

Saransh-cpp mentioned this pull request Jul 16, 2025

BUG: power called instead of square in __array_ufunc__ for structured arrays #29388

Closed

MaanasArora mentioned this pull request Jul 16, 2025

BUG: Any dtype should call square on arr ** 2 #29392

Merged

charris mentioned this pull request Jul 22, 2025

BUG: Any dtype should call square on arr ** 2 (#29392) #29417

Merged

Uh oh!

MAINT: simplify power fast path logic #27901

MAINT: simplify power fast path logic #27901

Uh oh!

Conversation

MaanasArora commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora commented Dec 4, 2024 • edited by seberg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Dec 4, 2024

Uh oh!

MaanasArora commented Dec 4, 2024

Uh oh!

MaanasArora commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Dec 5, 2024 • edited by seberg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora commented Dec 4, 2024 •

edited

Loading

MaanasArora Dec 4, 2024 •

edited

Loading

MaanasArora commented Dec 4, 2024 •

edited by seberg

Loading

seberg commented Dec 4, 2024 •

edited

Loading

MaanasArora commented Dec 5, 2024 •

edited

Loading

MaanasArora commented Dec 5, 2024 •

edited

Loading

MaanasArora commented Dec 5, 2024 •

edited by seberg

Loading

MaanasArora Dec 5, 2024 •

edited

Loading

MaanasArora Dec 5, 2024 •

edited

Loading

MaanasArora Dec 5, 2024 •

edited

Loading

MaanasArora Dec 5, 2024 •

edited

Loading

MaanasArora Dec 6, 2024 •

edited

Loading

MaanasArora commented Dec 6, 2024 •

edited

Loading

MaanasArora commented Dec 10, 2024 •

edited

Loading

MaanasArora commented Dec 11, 2024 •

edited

Loading

MaanasArora commented Dec 11, 2024 •

edited

Loading