Skip to content

Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

luke-gruber
Copy link
Contributor

In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th).

So, the error happens:

nt 1: Ractor.receive
rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
- thread_sched_lock(cur_th) (condvar) # acquires lock
- rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS

nt 2: port.send
ractor_wakeup_all()
- RACTOR_LOCK(port_r) # acquires lock
- thread_sched_lock # tries to acquire, HANGS

One solution would be to rework thread_sched_wait_running_turn() with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should.

Fixes [Bug #21398]

@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from fb0f525 to 9de6c4a Compare June 23, 2025 19:17

This comment has been minimized.

@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 2 times, most recently from 74599e3 to 4511622 Compare July 3, 2025 20:31
@jhawthorn
Copy link
Member

Here's another reproduction of the deadlock that I can confirm this fixes: https://bugs.ruby-lang.org/issues/20346

N = 1000
ractors = N.times.map do
  Ractor.new do
    Ractor.recv # Ractor doesn't start until explicitly told to
    # Do some calculations
    fib = ->(x) { x < 2 ? 1 : fib.call(x - 1) + fib.call(x - 2) }
    fib.call(20)
  end
end

threads = ractors.map { |r| Thread.new { r.value } }
ractors.each { |r| r.send(nil) }
threads.each_with_index do |th, i|
  p(i => th.value)
end

On master this often deadlocks with as little as N=10

ractor_sync.c Outdated
@@ -983,7 +983,16 @@ ractor_wakeup_all(rb_ractor_t *r, enum ractor_wakeup_status wakeup_status)
VM_ASSERT(waiter->wakeup_status == wakeup_none);

waiter->wakeup_status = wakeup_status;
rb_ractor_sched_wakeup(r, waiter->th);
#ifdef RUBY_THREAD_PTHREAD_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about it?

do {
  struct ractor_waiter *waiter;
  RACTOR_LOCK{
    waiter = ccan_list_pop(&r->sync.waiters, struct ractor_waiter, node);
    if (waiter) waiter->wakeup_status = wakeup_status;
  }
  if (waiter) rb_ractor_sched_wakeup(...);
} while (waiter);

This patch is same to your patch essentially, but it is simpler.
Now I'm not sure

ractor_sync.c Outdated
@@ -1018,10 +1028,18 @@ ubf_ractor_wait(void *ptr)
waiter->wakeup_status = wakeup_by_interrupt;
ccan_list_del(&waiter->node);

#ifdef RUBY_THREAD_PTHREAD_H
should_wake = true;
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use should_wake for non pthread and can remove ifdef macro.

@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 2 times, most recently from e1db15a to c1a1ec1 Compare July 8, 2025 19:38
@luke-gruber
Copy link
Contributor Author

luke-gruber commented Jul 8, 2025

So I made the changes, but there are windows failures in CI. I think the ractor lock is necessary to hold during broadcasting due to the windows implementation of rb_native_cond_broadcast. I changed it to rb_native_cond_signal, which should be thread-safe as long as 2 nts don't call it at the same time which is currently not possible in this scenario.

Edit: I added another commit on top to try to fix windows failures but I'm still getting some. I'm going to move on to other things for now, but I would appreciate it if you could take another look or if you have any ideas please share 🙇

@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 11 times, most recently from 3aa4a77 to 1e5319e Compare July 11, 2025 21:12
@luke-gruber
Copy link
Contributor Author

luke-gruber commented Jul 11, 2025

We were getting Windows timeout failures in CI with the approach of unlocking the ractor lock before wakeup. It's very hard to debug Windows issues especially because I don't have a computer with Windows installed (I tried for a couple of days), so I changed the waiter side instead. Everything seems to be working well. @ko1 Let me know if this is an acceptable solution for the time being. We plan on adding more assertions for lock ordering to prevent this kind of issue in the future, but that will be done in a separate PR.

@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 3 times, most recently from bf6137d to 5f39d4c Compare July 14, 2025 13:59
@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch 3 times, most recently from f1bbf64 to 9041218 Compare July 18, 2025 14:31
@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from 9041218 to 7da4951 Compare July 18, 2025 14:42
…d_wakeup()

In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire
RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup
if we're a dnt, in thread_sched_wait_running_turn() we acquire
thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr).
This lock inversion can cause a deadlock with rb_ractor_wakeup_all()
(ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then
thread_sched_lock(other_th).

So, the error happens:

nt 1:   Ractor.receive
            rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
              - thread_sched_lock(cur_th) (condvar) # acquires lock
              - rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS

nt 2: port.send
            ractor_wakeup_all()
              - RACTOR_LOCK(port_r) # acquires lock
              - thread_sched_lock # tries to acquire, HANGS

To fix it, we now unlock the thread_sched_lock before acquiring the
ractor_lock in rb_ractor_sched_wait().

Script that reproduces issue:

```ruby
require "async"
class RactorWrapper
  def initialize
    @Ractor = Ractor.new do
      Ractor.recv # Ractor doesn't start until explicitly told to
      # Do some calculations
      fib = ->(x) { x < 2 ? 1 : fib.call(x - 1) + fib.call(x - 2) }
      fib.call(20)
    end
  end

  def take_async
    @ractor.send(nil)
    Thread.new { @ractor.value }.value
  end
end

Async do |task|
  10_000.times do |i|
    task.async do
      RactorWrapper.new.take_async
      puts i
    end
  end
end
exit 0
```

Fixes [Bug #21398]

Co-authored-by: John Hawthorn <john.hawthorn@shopify.com>
@luke-gruber luke-gruber force-pushed the bug_21398_ractor_lock_ordering_issue branch from 7da4951 to b9a51b2 Compare July 18, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy