-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix lock ordering issue for rb_ractor_sched_wait() and rb_ractor_sched_wakeup() #13682
Conversation
fb0f525
to
9de6c4a
Compare
This comment has been minimized.
This comment has been minimized.
74599e3
to
4511622
Compare
Here's another reproduction of the deadlock that I can confirm this fixes: https://bugs.ruby-lang.org/issues/20346 N = 1000
ractors = N.times.map do
Ractor.new do
Ractor.recv # Ractor doesn't start until explicitly told to
# Do some calculations
fib = ->(x) { x < 2 ? 1 : fib.call(x - 1) + fib.call(x - 2) }
fib.call(20)
end
end
threads = ractors.map { |r| Thread.new { r.value } }
ractors.each { |r| r.send(nil) }
threads.each_with_index do |th, i|
p(i => th.value)
end On master this often deadlocks with as little as N=10 |
ractor_sync.c
Outdated
@@ -983,7 +983,16 @@ ractor_wakeup_all(rb_ractor_t *r, enum ractor_wakeup_status wakeup_status) | |||
VM_ASSERT(waiter->wakeup_status == wakeup_none); | |||
|
|||
waiter->wakeup_status = wakeup_status; | |||
rb_ractor_sched_wakeup(r, waiter->th); | |||
#ifdef RUBY_THREAD_PTHREAD_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about it?
do {
struct ractor_waiter *waiter;
RACTOR_LOCK{
waiter = ccan_list_pop(&r->sync.waiters, struct ractor_waiter, node);
if (waiter) waiter->wakeup_status = wakeup_status;
}
if (waiter) rb_ractor_sched_wakeup(...);
} while (waiter);
This patch is same to your patch essentially, but it is simpler.
Now I'm not sure
ractor_sync.c
Outdated
@@ -1018,10 +1028,18 @@ ubf_ractor_wait(void *ptr) | |||
waiter->wakeup_status = wakeup_by_interrupt; | |||
ccan_list_del(&waiter->node); | |||
|
|||
#ifdef RUBY_THREAD_PTHREAD_H | |||
should_wake = true; | |||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use should_wake
for non pthread and can remove ifdef macro.
e1db15a
to
c1a1ec1
Compare
So I made the changes, but there are windows failures in CI. I think the ractor lock is necessary to hold during broadcasting due to the windows implementation of Edit: I added another commit on top to try to fix windows failures but I'm still getting some. I'm going to move on to other things for now, but I would appreciate it if you could take another look or if you have any ideas please share 🙇 |
3aa4a77
to
1e5319e
Compare
We were getting Windows timeout failures in CI with the approach of unlocking the ractor lock before wakeup. It's very hard to debug Windows issues especially because I don't have a computer with Windows installed (I tried for a couple of days), so I changed the waiter side instead. Everything seems to be working well. @ko1 Let me know if this is an acceptable solution for the time being. We plan on adding more assertions for lock ordering to prevent this kind of issue in the future, but that will be done in a separate PR. |
bf6137d
to
5f39d4c
Compare
f1bbf64
to
9041218
Compare
9041218
to
7da4951
Compare
…d_wakeup() In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th). So, the error happens: nt 1: Ractor.receive rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn(): - thread_sched_lock(cur_th) (condvar) # acquires lock - rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS nt 2: port.send ractor_wakeup_all() - RACTOR_LOCK(port_r) # acquires lock - thread_sched_lock # tries to acquire, HANGS To fix it, we now unlock the thread_sched_lock before acquiring the ractor_lock in rb_ractor_sched_wait(). Script that reproduces issue: ```ruby require "async" class RactorWrapper def initialize @Ractor = Ractor.new do Ractor.recv # Ractor doesn't start until explicitly told to # Do some calculations fib = ->(x) { x < 2 ? 1 : fib.call(x - 1) + fib.call(x - 2) } fib.call(20) end end def take_async @ractor.send(nil) Thread.new { @ractor.value }.value end end Async do |task| 10_000.times do |i| task.async do RactorWrapper.new.take_async puts i end end end exit 0 ``` Fixes [Bug #21398] Co-authored-by: John Hawthorn <john.hawthorn@shopify.com>
7da4951
to
b9a51b2
Compare
In rb_ractor_sched_wait() (ex: Ractor.receive), we acquire RACTOR_LOCK(cr) and then thread_sched_lock(cur_th). However, on wakeup if we're a dnt, in thread_sched_wait_running_turn() we acquire thread_sched_lock(cur_th) after condvar wakeup and then RACTOR_LOCK(cr). This lock inversion can cause a deadlock with rb_ractor_wakeup_all() (ex: port.send(obj)), where we acquire RACTOR_LOCK(other_r) and then thread_sched_lock(other_th).
So, the error happens:
nt 1: Ractor.receive
rb_ractor_sched_wait() after condvar wakeup in thread_sched_wait_running_turn():
- thread_sched_lock(cur_th) (condvar) # acquires lock
- rb_ractor_lock_self(cr) # deadlock here: tries to acquire, HANGS
nt 2: port.send
ractor_wakeup_all()
- RACTOR_LOCK(port_r) # acquires lock
- thread_sched_lock # tries to acquire, HANGS
One solution would be to rework
thread_sched_wait_running_turn()
with DNT's. I didn't do this because it would be a bigger architectural change. What I changed is to unlock RACTOR_LOCK before calling rb_ractor_sched_wakeup() in a pthread env. In a non-pthread env it's safe to hold this lock, and we should.Fixes [Bug #21398]