Skip to content

moving x86-64 feature baseline to SSE4.2? #27851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rgommers opened this issue Nov 25, 2024 · 3 comments · May be fixed by #28896
Open

moving x86-64 feature baseline to SSE4.2? #27851

rgommers opened this issue Nov 25, 2024 · 3 comments · May be fixed by #28896
Labels
component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Comments

@rgommers
Copy link
Member

As of today, the SIMD "baseline" that we compile for goes up to SSE3, and any higher features are opt-in and runtime dispatched. SSE3 has been the maximum assumed feature for quite a while. We haven't reviewed this choice recently. At some point in the past we determined a rule of thumb saying that we could drop support for a particular feature (or lack thereof) if support for it dropped below 0.5%. That seems to be the case now for systems without SSE4.1 and SSE4.2.

Here is the full list of dispatchable targets and the features we currently build for each one, in the format "headers: enabled target list, e.g.:

Generating multi-targets for "_umath_tests.dispatch.h" 
  Enabled targets: AVX2, SSE41, baseline

Full set of dispatchable targets:

Generating multi-targets for "_umath_tests.dispatch.h" 
  Enabled targets: AVX2, SSE41, baseline
Generating multi-targets for "argfunc.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, SSE42, baseline
Generating multi-targets for "x86_simd_argsort.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort_16bit.dispatch.h" 
  Enabled targets: AVX512_SPR, AVX512_ICL
Generating multi-targets for "highway_qsort.dispatch.h" 
  Enabled targets: 
Generating multi-targets for "highway_qsort_16bit.dispatch.h" 
  Enabled targets: 
Generating multi-targets for "loops_arithm_fp.dispatch.h" 
  Enabled targets: AVX2, baseline
Generating multi-targets for "loops_arithmetic.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE41, baseline
Generating multi-targets for "loops_comparison.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE42, baseline
Generating multi-targets for "loops_exponent_log.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX512F, AVX2, baseline
Generating multi-targets for "loops_hyperbolic.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, baseline
Generating multi-targets for "loops_logical.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, baseline
Generating multi-targets for "loops_minmax.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, baseline
Generating multi-targets for "loops_modulo.dispatch.h" 
  Enabled targets: baseline
Generating multi-targets for "loops_trigonometric.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, baseline
Generating multi-targets for "loops_umath_fp.dispatch.h" 
  Enabled targets: AVX512_SKX, baseline
Generating multi-targets for "loops_unary.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, baseline
Generating multi-targets for "loops_unary_fp.dispatch.h" 
  Enabled targets: SSE41, baseline
Generating multi-targets for "loops_unary_fp_le.dispatch.h" 
  Enabled targets: SSE41, baseline
Generating multi-targets for "loops_unary_complex.dispatch.h" 
  Enabled targets: AVX512F, AVX2, baseline
Generating multi-targets for "loops_autovec.dispatch.h" 
  Enabled targets: AVX2, baseline
Generating multi-targets for "_simd.dispatch.h" 
  Enabled targets: SSE42, AVX2, FMA3, AVX512F, AVX512_SKX, baseline

The most widely used data source for determining what hardware is out there is, I believe, https://store.steampowered.com/hwsurvey/?platform=combined. That currently says that SSE3 is at 100%, SSE4.1 at 99.78% and SSE4.2 at 99.70%. Meaning that if we bump the baseline up to SSE4.2, we'd only be dropping support for ~0.3% of systems with really old CPUs.

image

For more context, SSE4.2 was introduced in 2008, and even Windows 11 (v2024H2) now requires it (xref https://en.wikipedia.org/wiki/SSE4#SSE4.2).

Now the other side of this coin is - what do we gain by making this change? I haven't quantified each item, but the basic answer is:

  • Reduces build time on x86-64: 40% of build targets (206/517) on my 6 year Intel CPU with AVX512 are SIMD targets. We can trim off a decent fraction of those.
  • Reduces binary size: numpy/_core/_simd.so currently is 3.1 MB out of 39.9 MB on disk for a Linux release build. Looking at the multi-targets list higher up, it looks like we can trim that a fair bit.
  • Reduces number of variations that should be tested in CI (linux_simd.yml). Given the current config, we can't actually drop a job, but we do make the test coverage higher (there are current zero test configs for baseline + SSE4.1/2).

I'd suggest making the change in main this release cycle, meaning for numpy 2.3.0, which will probably be released in June 2025.

Hat tip to @itamarst for bringing up this topic (xref scientific-python/faster-scientific-python-ideas#11).

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Nov 25, 2024
@charris
Copy link
Member

charris commented Nov 25, 2024

Just a note that my 11 year old Intel core i5 supports both sse4.1/2, and I think of it as a very old cpu. Yes, I will be moving to an AMD RYZEN 5 7600X soon, no machine lasts forever and monitors keep getting bigger.

@seberg
Copy link
Member

seberg commented Nov 29, 2024

Seems reasonable. This might even help clean up code a bit eventually because IIRC, SSE3 lacked quiet comparisons.
(But, only if relevant compiler versions stopped generating these incorrect instructions for scalar/auto-vectorized code.)

@rgommers
Copy link
Member Author

An update half a year later, SSE4.2 support improved from 99.70% to 99.78%:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy