Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

luke-gruber · 2025-06-23T19:03:25Z

In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th).

So, the error happens:

nt 1: Ractor.receive
rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
- thread_sched_lock(cur_th) (condvar) # acquires lock
- rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS

nt 2: port.send
ractor_wakeup_all()
- RACTOR_LOCK(port_r) # acquires lock
- thread_sched_lock # tries to acquire, HANGS

One solution would be to rework thread_sched_wait_running_turn() with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should.

Fixes [Bug #21398]

jhawthorn · 2025-07-08T00:43:39Z

Here's another reproduction of the deadlock that I can confirm this fixes: https://bugs.ruby-lang.org/issues/20346

N = 1000
ractors = N.times.map do
  Ractor.new do
    Ractor.recv # Ractor doesn't start until explicitly told to
    # Do some calculations
    fib = ->(x) { x < 2 ? 1 : fib.call(x - 1) + fib.call(x - 2) }
    fib.call(20)
  end
end

threads = ractors.map { |r| Thread.new { r.value } }
ractors.each { |r| r.send(nil) }
threads.each_with_index do |th, i|
  p(i => th.value)
end

On master this often deadlocks with as little as N=10

ko1 · 2025-07-08T01:04:59Z

ractor_sync.c

@@ -983,7 +983,16 @@ ractor_wakeup_all(rb_ractor_t *r, enum ractor_wakeup_status wakeup_status)
            VM_ASSERT(waiter->wakeup_status == wakeup_none);

            waiter->wakeup_status = wakeup_status;
-            rb_ractor_sched_wakeup(r, waiter->th);
+#ifdef RUBY_THREAD_PTHREAD_H


how about it?

do { struct ractor_waiter *waiter; RACTOR_LOCK{ waiter = ccan_list_pop(&r->sync.waiters, struct ractor_waiter, node); if (waiter) waiter->wakeup_status = wakeup_status; } if (waiter) rb_ractor_sched_wakeup(...); } while (waiter);

This patch is same to your patch essentially, but it is simpler.
Now I'm not sure

ko1 · 2025-07-08T04:02:53Z

ractor_sync.c

@@ -1018,10 +1028,18 @@ ubf_ractor_wait(void *ptr)
                waiter->wakeup_status = wakeup_by_interrupt;
                ccan_list_del(&waiter->node);

+#ifdef RUBY_THREAD_PTHREAD_H
+                should_wake = true;
+#else


we can use should_wake for non pthread and can remove ifdef macro.

luke-gruber · 2025-07-08T20:20:34Z

So I made the changes, but there are windows failures in CI. I think the ractor lock is necessary to hold during broadcasting due to the windows implementation of rb_native_cond_broadcast. I changed it to rb_native_cond_signal, which should be thread-safe as long as 2 nts don't call it at the same time which is currently not possible in this scenario.

Edit: I added another commit on top to try to fix windows failures but I'm still getting some. I'm going to move on to other things for now, but I would appreciate it if you could take another look or if you have any ideas please share 🙇

luke-gruber · 2025-07-11T21:31:34Z

We were getting Windows timeout failures in CI with the approach of unlocking the ractor lock before wakeup. It's very hard to debug Windows issues especially because I don't have a computer with Windows installed (I tried for a couple of days), so I changed the waiter side instead. Everything seems to be working well. @ko1 Let me know if this is an acceptable solution for the time being. We plan on adding more assertions for lock ordering to prevent this kind of issue in the future, but that will be done in a separate PR.

@Ractor

…d_wakeup() In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th). So, the error happens: nt 1: Ractor.receive rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn(): - thread_sched_lock(cur_th) (condvar) # acquires lock - rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS nt 2: port.send ractor_wakeup_all() - RACTOR_LOCK(port_r) # acquires lock - thread_sched_lock # tries to acquire, HANGS To fix it, we now unlock the thread_sched_lock before acquiring the ractor_lock in rb_ractor_sched_wait(). Script that reproduces issue: ```ruby require "async" class RactorWrapper def initialize @Ractor = Ractor.new do Ractor.recv # Ractor doesn't start until explicitly told to # Do some calculations fib = ->(x) { x < 2 ? 1 : fib.call(x - 1) + fib.call(x - 2) } fib.call(20) end end def take_async @ractor.send(nil) Thread.new { @ractor.value }.value end end Async do |task| 10_000.times do |i| task.async do RactorWrapper.new.take_async puts i end end end exit 0 ``` Fixes [Bug #21398] Co-authored-by: John Hawthorn <john.hawthorn@shopify.com>

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from fb0f525 to 9de6c4a Compare June 23, 2025 19:17

This comment has been minimized.

Sign in to view

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 2 times, most recently from 74599e3 to 4511622 Compare July 3, 2025 20:31

luke-gruber mentioned this pull request Jul 7, 2025

Allow concurrent use of system(), IO.popen and Process.spawn in ractors #13806

Closed

ko1 reviewed Jul 8, 2025

View reviewed changes

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 2 times, most recently from e1db15a to c1a1ec1 Compare July 8, 2025 19:38

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 11 times, most recently from 3aa4a77 to 1e5319e Compare July 11, 2025 21:12

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 3 times, most recently from bf6137d to 5f39d4c Compare July 14, 2025 13:59

jhawthorn mentioned this pull request Jul 14, 2025

Fix btest in ractor_test.rb that can lead timeout of the test #13876

Merged

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 3 times, most recently from f1bbf64 to 9041218 Compare July 18, 2025 14:31

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from 9041218 to 7da4951 Compare July 18, 2025 14:42

luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from 7da4951 to b9a51b2 Compare July 18, 2025 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

luke-gruber commented Jun 23, 2025

Uh oh!

This comment has been minimized.

jhawthorn commented Jul 8, 2025

Uh oh!

ko1 Jul 8, 2025

Uh oh!

ko1 Jul 8, 2025

Uh oh!

luke-gruber commented Jul 8, 2025 •

edited

Loading

Uh oh!

luke-gruber commented Jul 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Are you sure you want to change the base?

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

Conversation

luke-gruber commented Jun 23, 2025

Uh oh!

This comment has been minimized.

jhawthorn commented Jul 8, 2025

Uh oh!

ko1 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

ko1 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

luke-gruber commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luke-gruber commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

luke-gruber commented Jul 8, 2025 •

edited

Loading

luke-gruber commented Jul 11, 2025 •

edited

Loading