ENH: Use BLAS in np.einsum when optimize=False #29071

lysnikolaou · 2025-05-27T16:06:20Z

Proposed new feature or change:

As we discussed in the NumPy community meeting some time ago, we want to propose that BLAS is used in np.einsum, even if optimize=False. To make it clear, this is not changing the default of optimize (it still is going to be false), just changing the underlying behavior when it is false. I've done a lot of benchmarking to understand the performance implications of this. It all boils down to this:

I've verified, as mentioned in the meeting, that our handwritten version of einsum in C (PyArray_EinsteinSum) actually outperforms BLAS by quite a distance when calling np.einsum on relatively small arrays. I've written benchmarks that use 1-D, 2-D and 3-D arrays of different sizes. Here's the actual results:

| Change   | Before [6fb8dc25] <main>   | After [e0e97b04] <use-blas-even-if-nooptimize>   |   Ratio | Benchmark (Parameter)                                         |
|----------|----------------------------|--------------------------------------------------|---------|---------------------------------------------------------------|
| +        | 1.68±0.1μs                 | 2.84±0.04μs                                      |    1.69 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_small       |
| +        | 10.3±0.2μs                 | 12.0±0.1μs                                       |    1.17 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_small     |
| +        | 9.04±0.1μs                 | 10.5±0.1μs                                       |    1.16 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_small       |
| +        | 16.4±0.1μs                 | 18.6±0.2μs                                       |    1.14 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_small |
| -        | 81.6±0.7μs                 | 32.1±0.7μs                                       |    0.39 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_big         |
| -        | 547±7μs                    | 42.5±2μs                                         |    0.08 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_big       |
| -        | 1.65±0.01ms                | 90.0±1μs                                         |    0.05 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_big   |
| -        | 4.73±0.2ms                 | 170±4μs                                          |    0.04 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_big         |
| -        | 1.39±0.03s                 | 30.6±0.9ms                                       |    0.02 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_very_big    |

After trying out various different sizes of arrays to find a cut-off size, where BLAS starts to outperform our C implementation, I was able to get down to a size of ~20.000.
Next thing I tried was to implement a cut-off where BLAS is only enabled for arrays with more than 20.000 elements. Unfortunately, in order to do that, we need to parse the input to einsum, which is costly. We also do the parsing twice when falling back to the hand-written C version, once in Python to check the array sizes and once in C inside PyArray_EinsteinSum. It turns out this performs slightly worse than always using BLAS. These are the results:

| Change   | Before [6fb8dc25] <main>   | After [e0e97b04] <use-blas-even-if-nooptimize>   |   Ratio | Benchmark (Parameter)                                         |
|----------|----------------------------|--------------------------------------------------|---------|---------------------------------------------------------------|
| +        | 1.58±0.01μs                | 2.78±0.03μs                                      |    1.76 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_small       |
| +        | 10.1±0.05μs                | 11.9±0.2μs                                       |    1.18 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_small     |
| +        | 8.91±0.08μs                | 10.5±0.1μs                                       |    1.17 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_small       |
| +        | 16.3±0.04μs                | 18.2±0.07μs                                      |    1.12 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_small |
| -        | 80.2±0.7μs                 | 31.6±0.7μs                                       |    0.39 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_big         |
| -        | 531±1μs                    | 41.4±0.6μs                                       |    0.08 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_big       |
| -        | 1.65±0ms                   | 89.3±2μs                                         |    0.05 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_big   |
| -        | 4.53±0.03ms                | 161±2μs                                          |    0.04 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_big         |
| -        | 1.29±0.01s                 | 29.8±0.8ms                                       |    0.02 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_very_big    |

There's also a third caveat worth considering. The above benchmarks all use random.randn to generate the arrays to benchmark with. If the arrays have a specific structure, where branch prediction hits more often than normal, BLAS performs even worse in comparison and the cut-off becomes bigger and only when the array is 3D. If, for example, we use arange for the above benchmarks (rather than random.randn), they look like this:

| Change   | Before [6fb8dc25] <main>   | After [e0e97b04] <use-blas-even-if-nooptimize>   |   Ratio | Benchmark (Parameter)                                       |
|----------|----------------------------|--------------------------------------------------|---------|-------------------------------------------------------------|
| +        | 2.65±0.01s                 | 7.12±0.01s                                       |    2.69 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_very_big  |
| +        | 8.70±0.6μs                 | 19.5±0.3μs                                       |    2.24 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_medium    |
| +        | 1.42±0.01μs                | 2.66±0.04μs                                      |    1.87 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_small     |
| +        | 42.0±0.6μs                 | 54.9±1μs                                         |    1.31 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_big       |
| +        | 8.92±0.03ms                | 10.8±0.08ms                                      |    1.21 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_big       |
| +        | 10.6±0.06μs                | 12.2±0.06μs                                      |    1.15 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_small     |
| +        | 2.94±0.03ms                | 3.35±0.02ms                                      |    1.14 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_big |
| -        | 688±4μs                    | 336±3μs                                          |    0.49 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_big     |

So, the result of all this is that the performance gain when using BLAS is extreme, but also relatively sensitive to different arrays sizes and structures. What do people think about this in terms of what the best way forward would be?

The text was updated successfully, but these errors were encountered:

lysnikolaou added 01 - Enhancement component: numpy.einsum labels May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Use BLAS in np.einsum when optimize=False #29071

ENH: Use BLAS in np.einsum when optimize=False #29071

lysnikolaou commented May 27, 2025

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

ENH: Use BLAS in np.einsum when optimize=False #29071

ENH: Use BLAS in np.einsum when optimize=False #29071

Comments

lysnikolaou commented May 27, 2025

Proposed new feature or change:

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.