-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-128942: make arraymodule.c free-thread safe (lock-free) #130771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ping @colesbury |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Disclaimer: I'm not an expert on the FT list implementation, so take some of my comments with a grain of salt.
Seeing good single-threaded performance is nice, but what about multi-threaded scaling? The number of locks that are still here scare me a little--it would be nice if this scaled well for concurrent use as well, especially for operations that don't require concurrent writes (e.g., comparisons and copies).
Note, this is not ready to go, there is the memory issue which needs resolving. |
@ZeroIntensity you can remove the do-not-merge, its not an |
The main thing here for acceptance is a benchmark run which I am not able to start (I only did local pyperformance check against main), so someone with access will have to initiate that to compare with main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten a chance to look through arraymodule.c
yet. I'll review that later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall approach here seems good. A few comments below.
The actual
Are there any other places where this needs to take place? Its the test and trying to run it with
Which is not Left the bad |
I'd like
Yes, |
Use something like: https://gist.github.com/colesbury/96f27e2ddf6b151adeeb4c28ed7554d8 The alignment specifier has to be on |
Just a minor nit "MS_WINDOWS" or "_MSC_VER"? The latter is used in the codebase for MSVC-specific directives, and you can run gcc under Windows. |
|
Excessive QSBR memory usage: I ran across this while profiling memory usage here. The results are similar for both array and list objects which use QSBR to free memory, so this is a QSBR thing. Memory usage numbers for script (provided below) using both array and list (
Script: from queue import Queue
def thrdfunc(queue):
while True:
l = queue.get()
l.append(0) # force resize in non-parent thread which will free using _PyMem_ProcessDelayed()
queue = Queue(maxsize=2)
threading.Thread(target=thrdfunc, args=(queue,)).start()
while True:
l = array('i', [0] * int(3840*2160*3/4)) # using int instead of byte for reasons
# l = [None] * int(3840*2160*3/8) # sys.getsizeof(l) ~= 3840*2160*3 bytes
queue.put(l) Since delayed memory free checks (and subsequent frees if applicable) occur in one of two situations:
This works great for many small objects, but with larger buffers these can accumulate quickly. I tried a few things but the diff --git a/Python/pystate.c b/Python/pystate.c
index ee35f0fa945..d9d731a15bc 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -2169,6 +2169,9 @@ _PyThreadState_Attach(PyThreadState *tstate)
#if defined(Py_DEBUG)
errno = err;
#endif
+#ifdef Py_GIL_DISABLED
+ _PyMem_ProcessDelayed(tstate);
+#endif
}
static void Not saying it is THE solution, but at the very least it shows memory usage can be reduced with no hit to performance (timed that with Another option would be to add another check in Thoughts? |
Ping @colesbury, is this still a valid PR or should I close it? |
It's valid, just big. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about the mixing of atomics and critical sections. It's not too clear where to use one or the other.
#endif | ||
} | ||
#endif | ||
return (*ap->ob_descr->setitem)(data->items, i, v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be skipped if data == NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally yes this would be a glaring error but in this case items
is not a pointer and this is just pointer arithmetic. Index is also normally validated at this point so the data must be there except when it is -1 from ins1
, in which case setarrayitem
doesn't set it at all essentially acts as a query on the validity of the data being put. This -1 behavior comes from the original array module.
typedef struct {
Py_ssize_t allocated;
_Py_ALIGN_AS(8)
char items[];
} arraydata;
I added lock-free single element reads and writes by mostly copying the
list
object's homework. TL;DR: pyperformance scimark seems to be back to about what it was without the free-thread safe stuff (pending confirmation of course). Tried a few other things but the list strategy seems good enough (except for the negative index thing I mentioned in #130744, if that is an issue).Timings, the relevant ones are "OLD" - non free-thread safe arraymodule, "SLOW" - the previous slower PR and the last two "LFREERW".
array
module is not free-thread safe. #128942