Skip to content

gh-109934: notify cancelled futures on thread pool shutdown #134618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

duaneg
Copy link
Contributor

@duaneg duaneg commented May 24, 2025

When ThreadPoolExecutor shuts down it cancels any pending futures, however at present it doesn't notify waiters. Thus their state stays as CANCELLED instead of CANCELLED_AND_NOTIFIED and any waiters are not awakened.

Call set_running_or_notify_cancel on the cancelled futures to fix this.

When `ThreadPoolExecutor` shuts down it cancels any pending futures, however at
present it doesn't notify waiters. Thus their state stays as `CANCELLED`
instead of `CANCELLED_AND_NOTIFIED` and any waiters are not awakened.

Call `set_running_or_notify_cancel` on the cancelled futures to fix this.
duaneg added 2 commits July 16, 2025 10:40
… blocking

future has started before checking its status.
@lizzydavis695

This comment was marked as spam.

@duaneg
Copy link
Contributor Author

duaneg commented Jul 16, 2025

What on earth possessed me to say "I think I've managed to come up with [a unit test] that works works reliably", I have no idea. Utter foolishness. Oh well, we'll get there.

@chrisvanrun
Copy link

chrisvanrun commented Jul 17, 2025

What on earth possessed me to say "I think I've managed to come up with [a unit test] that works works reliably", I have no idea. Utter foolishness. Oh well, we'll get there.

Been there, done that!

In my local tests I just added a time-based approach that sets the the process at 10 seconds OR fail directly: and a generic maximum runtime of the test on 4s. Which is fine for the current project which also has a 'kill all children' directly following the shutdown.

I think a good approach would perhaps be to add a generic Lock, have each process stall on that except for one. The one gets picked up in the first 'batch' and then immediately errors out. Then call for executor.shutdown and subsequent release the lock to free up the stalled processes.

You could then assert for the final n tasks to be correctly canceled. I suspect a runtime variance here is that the failed task might or might not free-up a worker for the next Lock-blocked task; but perhaps the exector blocked while a future's completeness is being handled?

@duaneg
Copy link
Contributor Author

duaneg commented Jul 17, 2025

I think a good approach would perhaps be to add a generic Lock, have each process stall on that except for one. The one gets picked up in the first 'batch' and then immediately errors out. Then call for executor.shutdown and subsequent release the lock to free up the stalled processes.

Yeah, that is basically what the test does: submits a bunch of tasks, the first max_workers of which immediately block waiting on a barrier, so we know all workers are engaged (and blocked, so they will remain so), and that there are remaining tasks that are pending and hence will be cancelled. Then issue the shutdown, then release the barrier and unblock the workers.

You could then assert for the final n tasks to be correctly canceled. I suspect a runtime variance here is that the failed task might or might not free-up a worker for the next Lock-blocked task; but perhaps the exector blocked while a future's completeness is being handled?

This should be reliable, as the shutdown is initiated synchronously while all workers are blocked. At the point any of the tasks complete the executor must be shutdown and no additional pending tasks will be started.

However, there are lots of tricky details in ensuring this all works robustly. E.g. once the executor is shutdown use of any synchronisation primitives may fail, depending on timing and details of the implementation, so we have to handle BrokenBarrierError (both directly in the test and the indirectly if the workers hit it).

Also the internal multiprocessing machinery uses a work queue and considers tasks running (and no longer pending) once they are enqueued, even if they haven't been distributed to workers. It eagerly enqueues some extra tasks over-and-above the max number of workers to keep the pipeline filled. Those tasks will not be cancelled, but may or may not actually run, depending on timing.

Anyway, hopefully with all of that taken into account the test is now robust and reliable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy