You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we discussed in the NumPy community meeting some time ago, we want to propose that BLAS is used in np.einsum, even if optimize=False. To make it clear, this is not changing the default of optimize (it still is going to be false), just changing the underlying behavior when it is false. I've done a lot of benchmarking to understand the performance implications of this. It all boils down to this:
I've verified, as mentioned in the meeting, that our handwritten version of einsum in C (PyArray_EinsteinSum) actually outperforms BLAS by quite a distance when calling np.einsum on relatively small arrays. I've written benchmarks that use 1-D, 2-D and 3-D arrays of different sizes. Here's the actual results:
After trying out various different sizes of arrays to find a cut-off size, where BLAS starts to outperform our C implementation, I was able to get down to a size of ~20.000.
Next thing I tried was to implement a cut-off where BLAS is only enabled for arrays with more than 20.000 elements. Unfortunately, in order to do that, we need to parse the input to einsum, which is costly. We also do the parsing twice when falling back to the hand-written C version, once in Python to check the array sizes and once in C inside PyArray_EinsteinSum. It turns out this performs slightly worse than always using BLAS. These are the results:
There's also a third caveat worth considering. The above benchmarks all use random.randn to generate the arrays to benchmark with. If the arrays have a specific structure, where branch prediction hits more often than normal, BLAS performs even worse in comparison and the cut-off becomes bigger and only when the array is 3D. If, for example, we use arange for the above benchmarks (rather than random.randn), they look like this:
So, the result of all this is that the performance gain when using BLAS is extreme, but also relatively sensitive to different arrays sizes and structures. What do people think about this in terms of what the best way forward would be?
The text was updated successfully, but these errors were encountered:
Proposed new feature or change:
As we discussed in the NumPy community meeting some time ago, we want to propose that BLAS is used in
np.einsum
, even ifoptimize=False
. To make it clear, this is not changing the default ofoptimize
(it still is going to be false), just changing the underlying behavior when it is false. I've done a lot of benchmarking to understand the performance implications of this. It all boils down to this:einsum
in C (PyArray_EinsteinSum
) actually outperforms BLAS by quite a distance when callingnp.einsum
on relatively small arrays. I've written benchmarks that use 1-D, 2-D and 3-D arrays of different sizes. Here's the actual results:einsum
, which is costly. We also do the parsing twice when falling back to the hand-written C version, once in Python to check the array sizes and once in C insidePyArray_EinsteinSum
. It turns out this performs slightly worse than always using BLAS. These are the results:random.randn
to generate the arrays to benchmark with. If the arrays have a specific structure, where branch prediction hits more often than normal, BLAS performs even worse in comparison and the cut-off becomes bigger and only when the array is 3D. If, for example, we usearange
for the above benchmarks (rather thanrandom.randn
), they look like this:So, the result of all this is that the performance gain when using BLAS is extreme, but also relatively sensitive to different arrays sizes and structures. What do people think about this in terms of what the best way forward would be?
The text was updated successfully, but these errors were encountered: