-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
core: Add MICROPY_USE_GCC_MUL_OVERFLOW_INTRINSIC. #17754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #17754 +/- ##
==========================================
- Coverage 98.38% 98.38% -0.01%
==========================================
Files 171 171
Lines 22239 22224 -15
==========================================
- Hits 21880 21865 -15
Misses 359 359 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Code size report:
|
I'm surprised the code size on rv32 is unchanged. I must have the wrong preprocessor check. Locally I found that rp2 (rp2040) is expected to not change; the new code path is not enabled on Cortex M0 CPUs. |
That's a very nice code size decrease! (I have a MicroPython project running on a very small MCU which has run out of space, even using LTO, and I'll definitely be applying this patch to it.) |
And enable it on platforms where I am aware an efficient 32x32->64 bit multiply instruction exists. Signed-off-by: Jeff Epler <jepler@gmail.com>
7d6f557
to
1334ce7
Compare
(note: this should probably end up squashed) Most MCUs apart from Cortex-M0 with Thumb 1 have an instruction for computing the "high part" of a multiplication (e.g., the upper 32 bits of a 32x32 multiply). When they do, gcc uses this to implement a small and fast overflow check using the __builtin_mul_overflow intrinsic, which is preferable to the guard division method used in smallint.c. However, in contrast to the previous mp_small_int_mul_overflow routine, which checks that the result fits not only within mp_int_t but is SMALL_INT_FITS(), __builtin_mul_overflow only checks for overflow of the C type. As a result, a slight change in the code flow is needed for MP_BINARY_OP_MULTIPLY. Other sites using mp_small_int_mul_overflow already had the result value flow through to a SMALL_INT_FITS check so they didn't need any additional changes. Signed-off-by: Jeff Epler <jepler@gmail.com>
1334ce7
to
19000d6
Compare
Any suggestion how to structure this change better? |
And enable it on platforms where I am aware an efficient 32x32->64 bit multiply instruction exists.
Summary
In the discussion of #17734 I became aware there was some existing use of the builtin overflow intrinsics, particularly for the longlong build.
This PR tests using it in place of
mp_small_int_mul_overflow
.Testing
I ran the testsuite locally (64-bit standard build). However, I don't know if the testsuite adequately checks multiplications "at the boundary" of the short integer range.
I also did some investigating and found a check for riscv, x86/x86_64, and arm that seems to capture the "is there a suitable multiply instruction". A check for xtensa is missing but could be beneficial.
I think there might be a modest performance benefit (avoiding multiple divisions per multiplication) but I did not attempt to measure it.
Trade-offs and Alternatives
I am not happy with the structure of how this ended up implemented, particularly for the
int*int
multiply in mp_binary_op. It's more complicated than I would like due to the fact thatmp_small_int_mul_overflow
also implicitly checks for SMALL_INT_FITS while__builtin_mul_overflow
just checks if the C type (e.g., mp_int_t) overflows. However, if/when tests pass & code size comes in smaller, it may be worth looking for a way to structure the change that's acceptable that still gets the size benefit.