Skip to content

MAINT: simplify power fast path logic #27901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jan 7, 2025

Conversation

MaanasArora
Copy link
Contributor

@MaanasArora MaanasArora commented Dec 4, 2024

This is an initial draft to resolve #27082.

I have removed the fast paths from array_power and am planning to implement them in the individual UFunc templates which seem to be few. This prevents the need for scalar extraction and reduces divergence from the UFunc. Currently, I have only implemented for integer loops, but would appreciate feedback before I proceed.

I hope I understood the issue correctly. Thank you!

@MaanasArora MaanasArora changed the title Maint/simpler power fast paths MAINT: move power fast path logic out of array power to umath loops Dec 4, 2024
@MaanasArora MaanasArora changed the title MAINT: move power fast path logic out of array power to umath loops MAINT: simplify power fast path logic Dec 4, 2024
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for the start! I was worried that int_arr**-1 may have changed, but it always error'd in both versions :).

So, this would certainly fix the issue and is a very nice amount of code deletions.

I think the main thing that we need to do is ensure that there is no (big) speed regression with any of the cases that are currently fast-pathed.
(That is all the cases for floats and ** 2 for integers, I think.)

We may already have benchmarks for these (not sure), if that is so, then just running them and showing the result would be sufficient or start with trying manually and then we see if we should add benchmarks.

}
if (in1 == 1) {
else if (in1 == 1) {
*((@type@ *)op1) = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by the new (and existing) fast paths here. The first branch is the scalar case and it would seem to me that any fast path is even more relevant there?
Should the fast path just be copied? Maybe there should be a second helper with the body so that we can have a if (stride[1] == 0) {call helper()} else {call_helper()} to nudge the compiler to optimize for 0 stride (and assume it'll lift the checks out of the loop then).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the if (steps[1]==0) { path (sorry, I wrote stride above, the ufunc code here calls it "steps") is the path we need to worry about being fast.

I.e. if the second operand is a scalar, then we will always take that part (I am not 100% sure about scalar**scalar.).

Copy link
Contributor Author

@MaanasArora MaanasArora Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I stuck to the original fast path logic but I agree the scalar case is more important for efficiency. I've created a helper function to remove repetition.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 4, 2024

Thanks for the review :)

We do seem to have benchmarks (BinaryBench) for powers. I ran it for the cases I have written so far (integers).

This is the output without the changes in this PR:

[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   9.56±0.03ms 
              numpy.int64   9.52±0.05ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                    ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32     3.90±0ms  
              numpy.int64   4.34±0.01ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                     ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.35±0.05ms 
               numpy.int64    3.64±0.1ms 
              ============= =============```

And with changes:

```[85.71%] ··· bench_ufunc.BinaryBenchInteger.time_pow                         ok
[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32    10.9±0.1ms 
              numpy.int64   10.6±0.02ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                    ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   3.33±0.05ms 
              numpy.int64    3.67±0.2ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                     ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.19±0.04ms 
               numpy.int64    3.65±0.2ms 
              ============= =============

I suppose time_pow is somewhat concerning, though I'm not exactly sure why--perhaps because of the scalar checking? As it is the only benchmark where b (exponent) is an array. Can probably be improved.

@seberg
Copy link
Member

seberg commented Dec 4, 2024

Can we (I suppose we can) trust that the compiler is smart enough to lift that if (step[1] == 0) out of the loop?
Either way, I don't really think a 5% slowdown is super concerning, and it may also be partially a fluke.

The main difference should be that for **2 the old code didn't have to create np.array(2) to do the operation. So adding these fast-paths sacrifices that.

Not sure we should be concerned about it, but if anyone is, an exact fast path for common Python integers only to call square or reciprocal could still make sense.
(Exact fast path for integers, because that is the normal code and makes the fast-path code very simple.)

EDIT: Of course **0.5 (exact python float 0.5) matching np.sqrt also makes sense. The point is always the exact, ignoring even subclasses.

@MaanasArora
Copy link
Contributor Author

Yes, a surface-level check in array_power for Python integers and floats (exact type matches) should be a good trade-off, so that operators don't have to cast to numpy. Going to implement that, and the float fast paths. Thanks.

@MaanasArora
Copy link
Contributor Author

Both float fast paths and the array power fast paths are complete!

It seems a specific test under test_regression is breaking. As far as I can tell, the issue seems to be that, because of calling reciprocal on the object arrays during execution of linalg.norm, there is a RuntimeWarning (zero division) instead of a ValueError. On inspecting the other occurrences, it doesn't seem like the other ValueErrors are exactly the same either, so I'm assuming I can change the test to RuntimeWarning. It does seem inconsistent, but I suppose it indicates a problem with the test logic?

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 5, 2024

The regression test seems to concern norms on object arrays. I have updated the test to consider a larger amount of possible exceptions. Perhaps if someone was using ValueErrors to guard against doing norms on object arrays, it would be a concern, though it seems obscure to me.

Aside from this, my PR should be ready! I will print the complete benchmarks soon.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 5, 2024

My bad, just fixed an erroneously declared extra variable in the loops.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 5, 2024

Wrote an enhancing check to improve speed when the exponent is a scalar.

Here are the benchmarks with changes:

[57.14%] ··· bench_ufunc.BinaryBench.time_atan2                                                                                            ok
[57.14%] ··· =============== ============
                  dtype                  
             --------------- ------------
              numpy.float32   23.7±0.1ms 
              numpy.float64   19.4±0.1ms 
             =============== ============

[64.29%] ··· bench_ufunc.BinaryBench.time_pow                                                                                              ok
[64.29%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   7.40±0.04ms 
              numpy.float64   12.0±0.07ms 
             =============== =============

[71.43%] ··· bench_ufunc.BinaryBench.time_pow_2                                                                                            ok
[71.43%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   1.97±0.01ms 
              numpy.float64   2.14±0.01ms 
             =============== =============

[78.57%] ··· bench_ufunc.BinaryBench.time_pow_half                                                                                         ok
[78.57%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   2.39±0.03ms 
              numpy.float64    2.54±0.3ms 
             =============== =============

[85.71%] ··· bench_ufunc.BinaryBenchInteger.time_pow                                                                                       ok
[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   10.2±0.06ms 
              numpy.int64   10.2±0.07ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                                                                                  ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   3.66±0.02ms 
              numpy.int64    3.93±0.5ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                                                                                   ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.12±0.01ms 
               numpy.int64    3.37±0.2ms 
              ============= =============

And without:

[57.14%] ··· bench_ufunc.BinaryBench.time_atan2                                                                                            ok
[57.14%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   23.6±0.02ms 
              numpy.float64   19.3±0.03ms 
             =============== =============

[64.29%] ··· bench_ufunc.BinaryBench.time_pow                                                                                              ok
[64.29%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   7.43±0.02ms 
              numpy.float64     11.9±0ms  
             =============== =============

[71.43%] ··· bench_ufunc.BinaryBench.time_pow_2                                                                                            ok
[71.43%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   2.04±0.01ms 
              numpy.float64   1.99±0.01ms 
             =============== =============

[78.57%] ··· bench_ufunc.BinaryBench.time_pow_half                                                                                         ok
[78.57%] ··· =============== =============
                  dtype                   
             --------------- -------------
              numpy.float32   7.41±0.01ms 
              numpy.float64    12.1±0.2ms 
             =============== =============

[85.71%] ··· bench_ufunc.BinaryBenchInteger.time_pow                                                                                       ok
[85.71%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   9.53±0.06ms 
              numpy.int64    9.74±0.3ms 
             ============= =============

[92.86%] ··· bench_ufunc.BinaryBenchInteger.time_pow_five                                                                                  ok
[92.86%] ··· ============= =============
                 dtype                  
             ------------- -------------
              numpy.int32   3.88±0.02ms 
              numpy.int64    3.97±0.2ms 
             ============= =============

[100.00%] ··· bench_ufunc.BinaryBenchInteger.time_pow_two                                                                                   ok
[100.00%] ··· ============= =============
                  dtype                  
              ------------- -------------
               numpy.int32   3.28±0.01ms 
               numpy.int64   3.49±0.01ms 
              ============= =============```
</details>

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you don't mind if there may be a bit more iteration. You were so active that I thought I should put in some feedback.

Overall, this looks good. Need to rethink the new fast-paths once and would like to double check on what changed exactly in those tests.

{
if (!PyArray_Check(o1)) {
return -1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this?

Copy link
Contributor Author

@MaanasArora MaanasArora Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems removing it causes segmentation faults in the tests. Perhaps the compiler uses it to optimize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i was going to delete it and forgot. If (and only if) the later exact checks match then the PyArray_ISOBJECT can be done safely (although some code paths used to call this with invalid inputs, I don't think they do anymore).

So you can probably remove it, but it would require re-organizing the PyArray_ISOBJECT check to the end.
(If it still segfaults, some scalar path calls in here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so that it would be memory-safe in case of the nonzero returns! Thanks--this is done.

PyArrayObject *a1 = (PyArrayObject *)o1;
if (PyArray_ISOBJECT(a1)) {
return -1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Yes, object is special and we cannot do this (unless we changed the object ufuncs, but not sure that would be right for **0.5 at least.

One nitpick, though: Please don't use -1 as "not taken". -1 typically means error occurred, so I think it is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Fixed this.

assert_raises(ValueError, linalg.norm, testvector, ord=-2)
for ord in ['fro', 'nuc', np.inf, -np.inf, 0, -1, -2]:
pytest.raises(
(ValueError, ZeroDivisionError, RuntimeWarning),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe say an example of the core operation that changed (i.e. what kind of array to the power of what kind of scalar this ends up at)?
I would generally not worry about the type of error. But the RuntimeWarning is only an error in CI, maybe this gives a warning and then later raises or maybe not.

I agree those changes are probably fine, but I would feel better seeing an example (I can dig that up myself, but maybe you got it).

Copy link
Contributor Author

@MaanasArora MaanasArora Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the norm function does some casts on the object array:

    if not issubclass(x.dtype.type, (inexact, object_)):
        x = x.astype(float)

As far as I can tell, the regression test checks if this cast works. Probably why there are cases for the arrays being equal and the dtype being float64.

With the fast paths, we call reciprocal directly, which raises a ZeroDivisionError or RuntimeWarning (depending on deployment) instead of a ValueError.

I can provide a concrete traceback soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide that traceback/details (I assume it's quick for you)? The RuntimeWarning case is the interesting one. If we ignore the RuntimeWarning, will we hit a ValueError later?

If we don't hit any error in that path, then we need to think about it more carefully.
(And also adjust the test, if just to add with np.errstate(something="raise"):).

else if (in2 == 0.5) {
BINARY_LOOP_SLIDING {
const @type@ in1 = *(@type@ *)ip1;
*(@type@ *)op1 = sqrt(in1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to use some type specific sqrt function. More importantly, though, we won't get all the optimized versions here.

You should probably call @TYPE@_sqrt (which includes the loop), but I think that might not dispatch right. @seiko2plus can probably say instantly what to put here to call sqrt directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I'll try to get @TYPE@_sqrt to dispatch right.

Copy link
Contributor Author

@MaanasArora MaanasArora Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that a C type specific sqrt function is assigned on line 229:

* #sqrt = sqrt, sqrtf#

I suppose it is still better for consistency to use the UFunc implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On looking it up it seems @TYPE@_sqrt is indeed difficult to dispatch right without allocating a new array, so we would indeed need another call.

Given that type-specific functions are defined though, I'm wondering if just pushing all of these conditions into the BINARY_LOOP_SLIDING loop and seeing if the compiler optimizes might be worth the tradeoff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used @sqrt@ and restructured the loop to include the conditions. It doesn't seem to have hurt performance and looks cleaner.

}
return;
}
else if (in2 == 2.0) {
BINARY_LOOP_SLIDING {
const @type@ in1 = *(@type@ *)ip1;
*(@type@ *)op1 = in1 * in1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, here we also should maybe just call the normal multiply loop (with duplicated op1 pointer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do with the SQRT!

BINARY_LOOP_SLIDING {
@type@ in1 = *(@type@ *)ip1;
if (_@TYPE@_power_fast_path_helper(in1, in2, (@type@ *)op1) != 0) {
*((@type@ *) op1) = _@TYPE@_squared_exponentiation_helper(in1, in2start, first_bit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, this can never be taken right if we have the outer if? Maybe put an assert(0) /* unreachable */ to clarify or so?

(Maybe it was nicer without the outer check and the compiler will do it anyway, but this is good).

Copy link
Contributor Author

@MaanasArora MaanasArora Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it can, because of the in1 != 0 check. The memory overlap symmetry tests fail when I do not check if in1 == 0 to do the fast paths. It seems zeros play a special role in memory overlap? (We would not need the outer check for efficiency if not for this case.)

Edit: to clarify, the outer check is for efficiency, as the compiler cannot be relied upon to simplify if we have the in2 != 0 check the way it is (though perhaps there is a way to write things to improve a lot of this -- investigating that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so it seems some of the boolean logic was the issue for the symmetry checks. Unfortunately it doesn't seem like the compiler optimizes either way, inside the loop. So I created a boolean variable to keep track of whether any fastop check has failed. It looks a bit confusing, though it improved the benchmarks.

Copy link
Contributor Author

@MaanasArora MaanasArora Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: seems like the compiler can optimize if the check boolean is declared outside of the loop, and given the boolean logic in the helper is fixed, we have a better looking solution now.

if (PyLong_CheckExact(o2)) {
long exp = PyLong_AsLong(o2);
if (error_converting(exp)) {
PyErr_Clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine and happy to keep it as is. In general blanket PyErr_Clear() has a slight "code smell" (we still have plenty of them in NumPy, though).

This is possible to remove with PyLong_AsLongAndOverflow because if an error occurs with that, I think we can assume it is a critical error (KeyboardInterrupt).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Made this change.

@MaanasArora
Copy link
Contributor Author

No, don't mind at all! I appreciate the feedback.

I am rethinking some of the logic I built to pass the symmetry checks. Perhaps it will help clean up the code further.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 6, 2024

Made several changes, including adding the reciprocal fast path to floats though it probably needs to be optimized. Thanks for reviewing.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 10, 2024

After running the benchmarks, it seems that the performance decreases could indeed be flukes, and on the whole, checking for Python scalars might have actually reduced performance by a slight amount. I removed them to decrease the amount of code. If we would still like to have them, happy to add back.

I couldn't find an efficient way to dispatch @TYPE@_sqrt and the others; they required moving around the memory. To be fair, these new fast paths are in the UFuncs for specific types.

Here are the compared benchmarks (run with asv run -a repeat=50 -a rounds=10 to reduce noise):

| Change   | Before [5047f7be] <master>   | After [b5f8a2be] <maint/simpler-power-fast-paths>   |   Ratio | Benchmark (Parameter)                                               |
|----------|------------------------------|-----------------------------------------------------|---------|---------------------------------------------------------------------|
|          | 17.9±0.3ms                   | 17.7±0.09ms                                         |    0.99 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float32'>)         |
|          | 14.9±0.1ms                   | 14.8±0.1ms                                          |    0.99 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float64'>)         |
|          | 4.93±0.2ms                   | 4.96±0.06ms                                         |    1.01 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)           |
|          | 10.5±0.5ms                   | 10.5±0.07ms                                         |    1    | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float64'>)           |
|          | 493±30μs                     | 540±20μs                                            |    1.1  | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
| -        | 937±200μs                    | 581±30μs                                            |    0.62 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
| -        | 4.94±0.04ms                  | 802±6μs                                             |    0.16 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |
| -        | 10.4±0.06ms                  | 1.61±0.02ms                                         |    0.15 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
| +        | 5.27±0.1ms                   | 6.85±0.04ms                                         |    1.3  | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int32'>)      |
| +        | 5.56±0.2ms                   | 6.92±0.1ms                                          |    1.25 | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int64'>)      |
|          | 1.25±0.04ms                  | 1.33±0.01ms                                         |    1.07 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
|          | 1.42±0.2ms                   | 1.36±0.04ms                                         |    0.96 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| +        | 846±50μs                     | 1.16±0.03ms                                         |    1.38 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int32'>)  |
|          | 1.15±0.2ms                   | 1.17±0.1ms                                          |    1.01 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int64'>)  |

This should be ready for review! Thank you.

Edit: there seems to be a test fail with an absolute difference ~2e-7 in one specific environment. Seems to do with the sqrt implementation, though I'm not sure quite what to do about it (doubt it is worth it to add back the Python scalar checks for a very specific occasion--could it partly be a fluke?)

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I am happy to not have any fast paths. The only worry I have are the fast paths for float arrays getting slower.

Can you check (if just with %timeit the speed of large_float_arr**0.5 and large_float_arr**2?
I suspect we don't have any benchmarks for the operators here (you could add them of course).

@seiko2plus if you could give a hand and see if we can just forward the call to the actual SIMD implementations easily that would be nice.
(It's a bit annoying, because we have to build new strides/pointers though...)

So, while I would love to just get rid of the fast-paths. I do suspect we need them at least for floats to avoid a regression.
(The ufuncs can do it now in theory, but I think it is too much hassle to do in this PR unless we can do it hassle free enough from inside the inner-loop.)

assert_raises(ValueError, linalg.norm, testvector, ord=-2)
for ord in ['fro', 'nuc', np.inf, -np.inf, 0, -1, -2]:
pytest.raises(
(ValueError, ZeroDivisionError, RuntimeWarning),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide that traceback/details (I assume it's quick for you)? The RuntimeWarning case is the interesting one. If we ignore the RuntimeWarning, will we hit a ValueError later?

If we don't hit any error in that path, then we need to think about it more carefully.
(And also adjust the test, if just to add with np.errstate(something="raise"):).

@MaanasArora
Copy link
Contributor Author

Thanks! That makes sense. Added the benchmarks. Yes, float operators did get slower:

| Change   | Before [5047f7be] <master>   | After [b5f8a2be] <maint/simpler-power-fast-paths>   |   Ratio | Benchmark (Parameter)                                               |
|----------|------------------------------|-----------------------------------------------------|---------|---------------------------------------------------------------------|
|          | 17.7±0.05ms                  | 17.9±0.4ms                                          |    1.01 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float32'>)         |
|          | 14.9±0.2ms                   | 15.0±0.09ms                                         |    1.01 | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float64'>)         |
|          | 4.92±0.05ms                  | 5.01±0.09ms                                         |    1.02 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)           |
|          | 10.5±0.2ms                   | 10.7±0.3ms                                          |    1.01 | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float64'>)           |
| +        | 419±5μs                      | 629±60μs                                            |    1.5  | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
| +        | 472±50μs                     | 1.16±0.5ms                                          |    2.45 | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
v
| +        | 165±8μs                      | 596±70μs                                            |    3.61 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>)      |
| +        | 344±5μs                      | 1.04±0.3ms                                          |    3.01 | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float64'>)      |
^
| -        | 4.88±0.04ms                  | 893±50μs                                            |    0.18 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |
| -        | 10.3±0.05ms                  | 2.19±0.2ms                                          |    0.21 | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
v
| +        | 201±0.4μs                    | 915±40μs                                            |    4.55 | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float32'>)   |
| +        | 806±2μs                      | 1.96±0.3ms                                          |    2.43 | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float64'>)   |
^
| +        | 5.12±0.02ms                  | 7.44±0.3ms                                          |    1.45 | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int32'>)      |
| +        | 5.35±0.07ms                  | 7.02±0.3ms                                          |    1.31 | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int64'>)      |
| +        | 1.20±0ms                     | 1.50±0.07ms                                         |    1.25 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
| +        | 1.25±0.02ms                  | 1.70±0.3ms                                          |    1.35 | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| +        | 802±2μs                      | 1.31±0.1ms                                          |    1.63 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int32'>)  |
| +        | 823±20μs                     | 1.55±0.5ms                                          |    1.88 | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int64'>)  |

I've added the fast paths back:

| Change   | Before [5047f7be] <master>   | After [efbd6b83] <maint/simpler-power-fast-paths>   | Ratio   | Benchmark (Parameter)                                               |
|----------|------------------------------|-----------------------------------------------------|---------|---------------------------------------------------------------------|
|          | 17.9±0.4ms                   | 17.9±0.05ms                                         | 1.00    | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float32'>)         |
|          | 15.0±0.5ms                   | 14.7±0.2ms                                          | 0.98    | bench_ufunc.BinaryBench.time_atan2(<class 'numpy.float64'>)         |
|          | 4.99±0.1ms                   | 4.97±0.04ms                                         | 1.00    | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float32'>)           |
|          | 10.4±0.1ms                   | 10.4±0.07ms                                         | 1.01    | bench_ufunc.BinaryBench.time_pow(<class 'numpy.float64'>)           |
|          | 500±30μs                     | 534±2μs                                             | 1.07    | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float32'>)         |
|          | 836±500μs                    | 557±30μs                                            | ~0.67   | bench_ufunc.BinaryBench.time_pow_2(<class 'numpy.float64'>)         |
|          | 210±50μs                     | 167±2μs                                             | ~0.79   | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float32'>)      |
|          | 404±100μs                    | 347±20μs                                            | ~0.86   | bench_ufunc.BinaryBench.time_pow_2_op(<class 'numpy.float64'>)      |
| -        | 4.96±0.04ms                  | 797±0.9μs                                           | 0.16    | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float32'>)      |
| -        | 10.6±0.1ms                   | 1.60±0ms                                            | 0.15    | bench_ufunc.BinaryBench.time_pow_half(<class 'numpy.float64'>)      |
|          | 202±10μs                     | 201±0.6μs                                           | 1.00    | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float32'>)   |
|          | 811±50μs                     | 806±2μs                                             | 0.99    | bench_ufunc.BinaryBench.time_pow_half_op(<class 'numpy.float64'>)   |
| +        | 5.18±0.1ms                   | 6.82±0.02ms                                         | 1.32    | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int32'>)      |
| +        | 5.39±0.2ms                   | 6.97±0.1ms                                          | 1.29    | bench_ufunc.BinaryBenchInteger.time_pow(<class 'numpy.int64'>)      |
|          | 1.19±0.04ms                  | 1.33±0ms                                            | ~1.11   | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int32'>) |
|          | 1.26±0.1ms                   | 1.35±0.01ms                                         | 1.07    | bench_ufunc.BinaryBenchInteger.time_pow_five(<class 'numpy.int64'>) |
| +        | 797±20μs                     | 1.13±0.02ms                                         | 1.42    | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int32'>)  |
|          | 895±200μs                    | 1.14±0.01ms                                         | ~1.28   | bench_ufunc.BinaryBenchInteger.time_pow_two(<class 'numpy.int64'>)  |

The RuntimeWarning case, which occurs when ord = -1, does eventually trigger a ZeroDivisionError as well:

In [1]: from numpy import linalg

In [2]: testvector = np.array([np.array([0, 1]), 0, 0], dtype=object)

In [3]: linalg.norm(testvector, ord=-1)
/home/maanas/Documents/open-source/numpy/build-install/usr/lib/python3.12/site-packages/numpy/linalg/_linalg.py:2788: RuntimeWarning: divide by zero encountered in reciprocal
  absx **= ord
/home/maanas/Documents/open-source/numpy/build-install/usr/lib/python3.12/site-packages/numpy/linalg/_linalg.py:2788: RuntimeWarning: invalid value encountered in reciprocal
  absx **= ord
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 linalg.norm(testvector, ord=-1)

File ~/Documents/open-source/numpy/build-install/usr/lib/python3.12/site-packages/numpy/linalg/_linalg.py:2788, in norm(x, ord, axis, keepdims)
   2786 else:
   2787     absx = abs(x)
-> 2788     absx **= ord
   2789     ret = add.reduce(absx, axis=axis, keepdims=keepdims)
   2790     ret **= reciprocal(ord, dtype=ret.dtype)

ZeroDivisionError: 0.0 cannot be raised to a negative power

If that looks fine, I could look into adjusting the test for sure, to ensure the warning does trigger one of the errors!

@seberg
Copy link
Member

seberg commented Dec 11, 2024

What happens in the polynomials in the end is fine, maybe. But the underlying issue is not fine.

If you look into it e.g. with pdb %pdb in ipython, you will see that this kicks in:

>>> np.array([0, 1])**-1
<stdin>:1: RuntimeWarning: divide by zero encountered in reciprocal
<stdin>:1: RuntimeWarning: invalid value encountered in reciprocal
array([9223372036854775807,                   1])  # result is platform dependent

but that piece of code must raise:

ValueError: Integers to negative integer powers are not allowed.

Not sure why, did you add a fast path for reciprocals which are not defined?

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 11, 2024

Yes, you're right, sorry for the oversight. It seems to be at the scalar level, where it seems I did not disallow any fast paths for integers. I'll fix that.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Dec 11, 2024

I have restricted the scalar fast paths to float and complex arrays. The regression test is now reverted to its original state. Doing more testing to ensure everything works as expected.

@seberg seberg requested a review from seiko2plus December 12, 2024 11:06
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaanasArora, I think we can basically put this in, and although I would like @seiko2plus (or @r-devulap) to have a look at the power changes because I suspect we can do better, I'll just put it in soon anyway.

The change might even make things worse, but only on avx512-skx, and otherwise the fast-path seems fine to me (whether it changes the results by a tiny bit or not).

@seberg
Copy link
Member

seberg commented Jan 7, 2025

OK, let's give this a shot, thanks. Still might be nice to think about how to nicely fall back to the specialized function for the fast-paths.

@seberg seberg merged commit 52162af into numpy:main Jan 7, 2025
67 checks passed
@MaanasArora
Copy link
Contributor Author

Thank you! Happy to continue helping with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Power fast-paths have wrong and confusing promotion logic
2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy