Skip to content

ENH: Use BLAS in np.einsum when optimize=False #29071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lysnikolaou opened this issue May 27, 2025 · 0 comments
Open

ENH: Use BLAS in np.einsum when optimize=False #29071

lysnikolaou opened this issue May 27, 2025 · 0 comments

Comments

@lysnikolaou
Copy link
Member

Proposed new feature or change:

As we discussed in the NumPy community meeting some time ago, we want to propose that BLAS is used in np.einsum, even if optimize=False. To make it clear, this is not changing the default of optimize (it still is going to be false), just changing the underlying behavior when it is false. I've done a lot of benchmarking to understand the performance implications of this. It all boils down to this:

  • I've verified, as mentioned in the meeting, that our handwritten version of einsum in C (PyArray_EinsteinSum) actually outperforms BLAS by quite a distance when calling np.einsum on relatively small arrays. I've written benchmarks that use 1-D, 2-D and 3-D arrays of different sizes. Here's the actual results:
| Change   | Before [6fb8dc25] <main>   | After [e0e97b04] <use-blas-even-if-nooptimize>   |   Ratio | Benchmark (Parameter)                                         |
|----------|----------------------------|--------------------------------------------------|---------|---------------------------------------------------------------|
| +        | 1.68±0.1μs                 | 2.84±0.04μs                                      |    1.69 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_small       |
| +        | 10.3±0.2μs                 | 12.0±0.1μs                                       |    1.17 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_small     |
| +        | 9.04±0.1μs                 | 10.5±0.1μs                                       |    1.16 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_small       |
| +        | 16.4±0.1μs                 | 18.6±0.2μs                                       |    1.14 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_small |
| -        | 81.6±0.7μs                 | 32.1±0.7μs                                       |    0.39 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_big         |
| -        | 547±7μs                    | 42.5±2μs                                         |    0.08 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_big       |
| -        | 1.65±0.01ms                | 90.0±1μs                                         |    0.05 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_big   |
| -        | 4.73±0.2ms                 | 170±4μs                                          |    0.04 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_big         |
| -        | 1.39±0.03s                 | 30.6±0.9ms                                       |    0.02 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_very_big    |
  • After trying out various different sizes of arrays to find a cut-off size, where BLAS starts to outperform our C implementation, I was able to get down to a size of ~20.000.
  • Next thing I tried was to implement a cut-off where BLAS is only enabled for arrays with more than 20.000 elements. Unfortunately, in order to do that, we need to parse the input to einsum, which is costly. We also do the parsing twice when falling back to the hand-written C version, once in Python to check the array sizes and once in C inside PyArray_EinsteinSum. It turns out this performs slightly worse than always using BLAS. These are the results:
| Change   | Before [6fb8dc25] <main>   | After [e0e97b04] <use-blas-even-if-nooptimize>   |   Ratio | Benchmark (Parameter)                                         |
|----------|----------------------------|--------------------------------------------------|---------|---------------------------------------------------------------|
| +        | 1.58±0.01μs                | 2.78±0.03μs                                      |    1.76 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_small       |
| +        | 10.1±0.05μs                | 11.9±0.2μs                                       |    1.18 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_small     |
| +        | 8.91±0.08μs                | 10.5±0.1μs                                       |    1.17 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_small       |
| +        | 16.3±0.04μs                | 18.2±0.07μs                                      |    1.12 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_small |
| -        | 80.2±0.7μs                 | 31.6±0.7μs                                       |    0.39 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_big         |
| -        | 531±1μs                    | 41.4±0.6μs                                       |    0.08 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_big       |
| -        | 1.65±0ms                   | 89.3±2μs                                         |    0.05 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_big   |
| -        | 4.53±0.03ms                | 161±2μs                                          |    0.04 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_big         |
| -        | 1.29±0.01s                 | 29.8±0.8ms                                       |    0.02 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_very_big    |
  • There's also a third caveat worth considering. The above benchmarks all use random.randn to generate the arrays to benchmark with. If the arrays have a specific structure, where branch prediction hits more often than normal, BLAS performs even worse in comparison and the cut-off becomes bigger and only when the array is 3D. If, for example, we use arange for the above benchmarks (rather than random.randn), they look like this:
| Change   | Before [6fb8dc25] <main>   | After [e0e97b04] <use-blas-even-if-nooptimize>   |   Ratio | Benchmark (Parameter)                                       |
|----------|----------------------------|--------------------------------------------------|---------|-------------------------------------------------------------|
| +        | 2.65±0.01s                 | 7.12±0.01s                                       |    2.69 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_very_big  |
| +        | 8.70±0.6μs                 | 19.5±0.3μs                                       |    2.24 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_medium    |
| +        | 1.42±0.01μs                | 2.66±0.04μs                                      |    1.87 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_small     |
| +        | 42.0±0.6μs                 | 54.9±1μs                                         |    1.31 | bench_linalg.EinsumNoOptimize.time_einsum_one_dim_big       |
| +        | 8.92±0.03ms                | 10.8±0.08ms                                      |    1.21 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_big       |
| +        | 10.6±0.06μs                | 12.2±0.06μs                                      |    1.15 | bench_linalg.EinsumNoOptimize.time_einsum_two_dim_small     |
| +        | 2.94±0.03ms                | 3.35±0.02ms                                      |    1.14 | bench_linalg.EinsumNoOptimize.time_einsum_two_three_dim_big |
| -        | 688±4μs                    | 336±3μs                                          |    0.49 | bench_linalg.EinsumNoOptimize.time_einsum_three_dim_big     |

So, the result of all this is that the performance gain when using BLAS is extreme, but also relatively sensitive to different arrays sizes and structures. What do people think about this in terms of what the best way forward would be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy