Measurements of viper GPIO speed on rp2350 #17371

arachsys · 2025-05-27T17:52:17Z

arachsys
May 27, 2025

While measuring jitter on hard vs soft IRQs on rp2350 with a scope, I got distracted into benchmarking different ways to do fast GPIO from a viper hard IRQ handler. I found the numbers interesting, so thought I'd post them in case anyone else is interested too.

As a baseline, with the default 150MHz clock frequency, if pin = machine.Pin(0, machine.Pin.OUT), calling pin.value(1) from a viper function takes around 4us with roughly 700ns jitter.

A more direct mem32[0xd0000018] = 1 takes about 2us with c. 500ns jitter.

Of course, viper can write memory directly and this is much faster. If sio is a ptr32 to 0xd0000000, the directly equivalent sio[6] = 1 takes only 14ns.

But there's a little trap here: initialising sio = ptr32(0xd0000000) takes 700ns! The same for sio = ptr32(13 << 28) or sio = ptr32(int(0xd0000000)). However, sio = ptr32(int(13) << 28) or sio = ptr32(int(0xd0) << 24) are fine and take just 26ns.

Even though we're using viper, the argument to ptr32()/int() is a python integer and if that's more than 30-bits we end up dereferencing an object. I understand why this happens but still managed to forget and end up surprised by it!

Similarly, something like ptr8(0)[0xd0000018] won't work at all because we're trying to index zero with an object not a machine int. However, ptr32(0)[0x34000006] = 1 works fine and is fast at 40ns (= 14ns + 26ns).

Final measurement: if we define

@micropython.viper
def wibble():
  pass

then calling wibble() from another viper function costs 1.5us, so it can be quite costly to break up a viper callback in the absence of any way to create a define-time macro.

peterhinch · 2025-05-27T19:19:21Z

peterhinch
May 27, 2025
Collaborator

Another approach to passing 32-bit (rather than 30-bit) ints to a Viper or asm function is to populate a 32-bit integer array with the values and pass the array as an arg.

0 replies

arachsys · 2025-05-28T07:17:06Z

arachsys
May 28, 2025
Author

The one I had high hopes might work was

SIO = const(0xd0000000)
@micropython.viper
def callback(_):
    sio = ptr32(SIO)

but even that involves a 700ns lookup. I assume the const substitution happens before the viper decorator even gets a chance to see it?

On the subject of array() and bytearray(), if a = bytearray(16), it takes around 1.4us to do x = ptr32(a) in a viper function but subsequent reads and writes are super-fast: 14ns (2 cycles) for x[1] = 123.

For me, this 100x difference really hammered home the importance of taking local pointers once at the start of a viper function, rather than doing any namespace lookups inside tight loops, as the documentation already cautions.

I see that

x[0] = 324
x[1] = 324
x[2] = 324

takes 40ns whereas

for i in range(3):
  x[i] = 324

takes 560ns and even

i = int(0)
while i < 3:
    x[i] = 324
    i = i + 1

takes 510ns, although luckily viper still doesn't need to allocate for the range(3). (Within viper, I don't think I need the int(0) there - it would automatically be int32 when assigned as 0?)

It'd be interesting to build a debug version of the code emitters and see what they're actually doing here. I'll need to work out how, though...

1 reply

bixb922 May 29, 2025

Nice results! Yes, some of the workings of viper can be surprising.

Did you see this page with more information about viper? https://github.com/micropython/micropython/wiki/Improving-performance-with-Viper-code

SIO = const(0xd0000000)

Viper integer constants are in the range -2**29 to 2**29-1. Outside that range, the constants are treated as builtins.int constants and can induce slow behavior unless you use the Viper type cast operators int or uint on the constant. Global const() variables are treated as builtins.int constants. Try SIO = uint(0xd0000000) inside the Viper function, or if SIO is a global const(), use viper_sio = uint(SIO). The Viper type cast operators are essentially of zero cost, similar to a C language type cast.

i = int(0)

This is the same as i = 0, because 0 falls within the range of Viper integer constants.

while i < 3:

I would expect this loop to be much faster. Did you test this in the same function as for i in range(3)? If so, then the variable i already is a builtins.int and the while loop will be much slower as if it were done with a Viper int.

While loops can be even faster if you compute the first and the last pointer and iterate until the pointer reaches the final value. That avoids using a counter. See here https://github.com/micropython/micropython/wiki/Improving-performance-with-Viper-code#range-vs-while

then calling wibble() from another viper function costs 1.5us, so it can be quite costly to break up a Viper callback

That's true. However a call to a Viper function is much faster than a call to a regular function. There are some restrictions: Viper functions do not accept keyword arguments nor optional arguments.

arachsys · 2025-05-29T17:21:25Z

arachsys
May 29, 2025
Author

Yes, I found the documentation and started to dig through the emitnative.c code generator a bit too, although not yet figured out how to build a debug version of mpy-cross that debug logs the generated assembler from viper so I can experiment with its behaviour more directly. (I rather lazily started disassembling the relevant bytes of the .mpy file, but that's a horrific way to do it!)

My measurement harness is as simple as

from machine import Pin, Timer
import micropython

Pin(0, Pin.OUT)

@micropython.viper
def callback(_):
    sio = ptr32(uint(0xd0) << 24)
    sio[6] = 1
    # Insert something here
    sio[8] = 1

timer = machine.Timer(freq = 1000, mode = Timer.PERIODIC,
    callback = callback, hard = True)

where the block to test is pasted inline. (Obviously it can't use the predefined sio as that would be cheating.)

Without any additions, this produces a 14ns pulse: essentially the two cycles to write sio[8]. I can offset the trigger to measure just the extra time added by the code spliced in.

Inserting x = uint(0xd0000000) adds around 450ns, so is very slow compared to x = uint(0xd0) << 24 at 14ns. The threshold is x = uint(0x3fffffff) which is super-fast (7ns = 1 cycle) whereas x = uint(0x40000000) takes 450ns.

x = int(-0x3fffffff) is also fast but any more negative is slow, which does allow me to rewrite the very slow sio = ptr32(0xd0000000) as a single cycle sio = ptr32(-0x30000000).

I can't find any way to write the fast version that has the literal 0xd0000000 in it, nor any way to express addresses between 0x40000000 and 0xc0000000 faster/clearer than a shifted smaller constant. If python had macros or define-time/inline expanded functions, I could write a helper than takes a define-time constant and emits something that viper will optimise well, but sadly python is not scheme.

Yes, I expected the while loop to be super-fast too, and I don't really understand why it's not. With no other code spliced in than:

i = 0
while i < 4:
  i = i + 1

the loop over four values of i doing nothing still costs 390ns or about 59 cycles.

i = 4
while i:
  i = i - 1

is a little bit better at 240ns but still not brilliant: I guess about 34 cycles?

1 reply

GitHubsSilverBullet May 30, 2025

Here's something a bit more precise. Maybe it's of use for testing:

#!/micropython
# -*- coding: UTF-8 -*-
# vim: fileencoding=utf-8: ts=4: sw=4: expandtab:
#┌────────────────────────────────────────────────────────────────────────────┒
#│      Viper timing measurements                                             ┃
#┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

from machine import freq as cpufreq
from time import sleep_ms

_SIOBASE    = const(0xd000_0000)
_SIOBASEHI  = const(_SIOBASE >> 16)
_SIOBASELO  = const(_SIOBASE & 0xffff)

_PPB_BASE   = const(0xe0000000)
_SYST_CSR   = const(0xe010 >>2)  # SysTick Control and Status Register
_SYST_CVR   = const(0xe018 >>2)  # SysTick Current Value Register

@micropython.viper
def enable_systiming():
    ppb: ptr32 = ptr32(_PPB_BASE)
    ppb[_SYST_CSR] = ( (1<<2)       # CLKSOURCE:Processor clock.
                     | (1<<0) )     # ENABLE: Enable SysTick counter
    print('# All measurement in CPU cycles')

@micropython.viper
def test_timing1():
    HIGH: uint = (1<<24) - 1000
    ppb: ptr32 = ptr32(_PPB_BASE)
    t0: int
    t1: int

    while (t0:=ppb[_SYST_CVR]) < HIGH: pass
    pointer1: ptr32 = ptr32(_SIOBASE)           # unnessessary slow
    t1 = ppb[_SYST_CVR]
    diff = t0 - t1
    print(f'# {diff:4}', end='   ')
    sleep_ms(100)

    while (t0:=ppb[_SYST_CVR]) < HIGH: pass
    pointer2: ptr32 = ptr32(uint(_SIOBASE))     # pretty bad, timing unstable
    t1 = ppb[_SYST_CVR]
    diff = t0 - t1
    print(f'{diff:4}', end='   ')
    sleep_ms(100)

    while (t0:=ppb[_SYST_CVR]) < HIGH: pass
    pointer3: ptr32 = ptr32((uint(_SIOBASEHI)<<16) | _SIOBASELO)    # fast
    t1 = ppb[_SYST_CVR]
    diff = t0 - t1
    print(f'{diff:4}', end='   ')
    sleep_ms(100)

    while (t0:=ppb[_SYST_CVR]) < HIGH: pass
    x: uint = (uint(_SIOBASEHI)<<16) | _SIOBASELO       # as is this
    pointer4: ptr32 = ptr32(x)
    t1 = ppb[_SYST_CVR]
    diff = t0 - t1
    print(f'{diff:4}' )
    sleep_ms(100)
    return

cpufreq(200_000_000)    # won't change measurements
enable_systiming()
for _ in range(10):
    test_timing1()

# All measurement in CPU cycles
#  184    849     24     28
#  184    345     24     28
#  184    395     24     28
#  184    446     24     28
#  184    347     24     28
#  184    295     24     28
#  184    194     24     28
#  184    346     24     28
#  184    446     24     28
#  184    347     24     28

arachsys · 2025-05-30T11:14:34Z

arachsys
May 30, 2025
Author

Nice, there's basically zero jitter in the overhead on your test harness. Saves firing up a scope!

I added two extra columns, one for pointer5 = ptr32(-0x30000000) inside and one for literally nothing inside to show the fixed overhead that needs subtracting off. Here are the resulting numbers from my board with cpufreq set at 200MHz, as with yours:

# All measurement in CPU cycles
#  114    383     17     19     13     10
#  114    292     17     19     13     10
#  114    440     17     19     13     10
#  114    450     17     19     13     10
#  114    198     17     19     13     10
#  114    439     17     19     13     10
#  114    286     17     19     13     10
#  114    322     17     19     13     10
#  114    443     17     19     13     10
#  114    319     17     19     13     10

PS For me the RVR register comes up as zero so I added ppb[_SYST_RVR] = -1 to get SysTick actually counting.

0 replies

arachsys · 2025-05-30T11:35:56Z

arachsys
May 30, 2025
Author

Here's a slightly boiled down version which disables interrupts during the test:

import machine
import micropython

@micropython.viper
def test() -> uint:
    ppb = ptr32(-0x20000000)      # PPB = 0xe0000000
    ppb[0x3804] = 0b101           # CSR = CLKSOURCE | ENABLE
    ppb[0x3805] = -1              # RVR = maximum

    state = machine.disable_irq()
    t0 = ppb[0x3806]              # t0 = CVR
    t1 = ppb[0x3806]              # t1 = CVR

    x = ptr32(-0x30000000)        # Line to benchmark

    t2 = ppb[0x3806]              # t2 = CVR
    machine.enable_irq(state)

    # Difference between t1 - t2 and t0 - t1 masked with RVR:
    return uint(t1 - t2 - t0 + t1) & ppb[0x3805]

print(*(test() for _ in range(20)))

and some corresponding results for variants you and I have measured above:

x = ptr32(-0x30000000)
# 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

x = int(0xd0000000)
# 839 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68

x = uint(0xd0000000)
# 839 220 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68

x = ptr32(0xd0000000)
# 1065 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104

x = ptr32(int(0xd0000000)
# 839 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68

x = ptr32(0xd0 << 24)
# 1065 221 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104

x = ptr32(int(0xd0) << 24)
# 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

x = ptr32((uint(0xd000) << 16) | 0x0000)
# 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

x = uint((uint(0xd000) << 16) | 0x0000)
y = ptr32(x)
# 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

i = 4
while i:
    i = i - 1
# 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51

i = 0
while i < 4:
    i = i + 1
# 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71

When I look at the initially puzzling difference between ptr32(0xd0000000) and int(0xd0000000), it occurs to me that, when passed an object (which 0xd0000000 will be), ptr32() has to decide whether to cast that object by interpreting as a bytearray or as an integer, whereas int() and uint() will assume it's a number. And that'll be why ptr32(int(0xd0000000)) is the same cost as int(0xd0000000) not ptr32(0xd0000000).

0 replies

andrewleech · 2025-05-30T23:21:04Z

andrewleech
May 30, 2025
Collaborator Sponsor

Another curious factor with gpio timing is that:

p = Pin(10)

p(1) # set pin high

is actually noticeably faster than:

p = Pin(10)

p.value(1) # set pin high

This is because in the second case there's a dictionary lookup internally to find the value attribute / function before it can run it.
The first case can just directly run the internal __call__() without a lookup needed, which for a pin ends up running the same gpio function.

0 replies

arachsys · 2025-05-31T07:35:45Z

arachsys
May 31, 2025
Author

Interesting, and 420ish cycles vs 700ish cycles (a bit jittery): almost twice as fast as you say.

I didn't know you could call pin objects directly like that! I wondered if I'd overlooked it when reading the documentation. The machine.Pin reference does mention __call__ and says "The call method provides a (fast) shortcut [...]" but it isn't shown in the examples, nor in the quick references for pyboard, rp2 , esp32, samd51, etc., even though these list a bunch of other aliases.

Maybe I should cook up a docs PR? pin(1) manages to be both more readable and faster than pin.value(1) so surely we ought to advertise it better.

Another fun one I stumbled across: MicroPython interns strings, so comparing two strings in viper is cheap and constant time, as is comparing two pin objects (say). There isn't an id() method on Pin() objects, but if there were, fetching and comparing them would be more expensive than comparing the objects directly... or presumably even comparing their string reprs!

[Edit after reading the code: no, comparing string reprs wouldn't be cheap like comparing objects is, because machine_pin_print() is constructing strings on the fly rather than returning a static interned one. In the general case, the repr contains dynamic info like mode, pull-down, etc. That also means it wouldn't be useful either.]

0 replies

MicroPython

Measurements of viper GPIO speed on rp2350 #17371

Uh oh!

Uh oh!

arachsys May 27, 2025

Replies: 7 comments · 2 replies

Uh oh!

peterhinch May 27, 2025 Collaborator

Uh oh!

Uh oh!

arachsys May 28, 2025 Author

Uh oh!

Uh oh!

bixb922 May 29, 2025

Uh oh!

Uh oh!

arachsys May 29, 2025 Author

Uh oh!

GitHubsSilverBullet May 30, 2025

Uh oh!

arachsys May 30, 2025 Author

Uh oh!

Uh oh!

arachsys May 30, 2025 Author

Uh oh!

andrewleech May 30, 2025 Collaborator Sponsor

Uh oh!

Uh oh!

arachsys May 31, 2025 Author

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

arachsys
May 27, 2025

Replies: 7 comments 2 replies

peterhinch
May 27, 2025
Collaborator

arachsys
May 28, 2025
Author

arachsys
May 29, 2025
Author

arachsys
May 30, 2025
Author

arachsys
May 30, 2025
Author

andrewleech
May 30, 2025
Collaborator Sponsor

arachsys
May 31, 2025
Author