ruby Mark Fibers in Mutex/Queue/SizedQueue wait lists

https://bugs.ruby-lang.org/issues/18818

CC: @ioquatix

Jun 06 '22 18:06 nevans

n.b. no tests.

Probably a test or spec should be added that repeatedly calls GC.start some arbitrary number of times and verifies that fibers waiting on Mutex, Queue, SizedQueue are never collected. e.g. something like https://github.com/ruby/ruby/blob/ec3542229b29ec93062e9d90e877ea29d3c19472/spec/ruby/library/weakref/fixtures/classes.rb#L13-L25

Jun 06 '22 22:06 nevans

Thanks for this. I think we need some feedback from @ko1 since he created the original implementation. I did experiment with this implementation while working on autoload code, but wasn't sure if it was appropriate and ultimately fixed it by having the scheduler retain the fiber which is reasonable but shouldn't be strictly necessary IMHO.

Jun 09 '22 01:06 ioquatix

@nevans do you mind rebasing this on master?

Jun 08 '23 02:06 ioquatix

@ioquatix Sure, I've been meaning to do that anyway but other things keep interrupting.

Jun 08 '23 15:06 nevans

@ioquatix I've rebased it. There were no conflicts identified by git, but I haven't (yet) looked at it myself to see why it doesn't work with the latest master. I may not get to it today, but I'll tag you again when I have done that.

Jun 08 '23 17:06 nevans

I'm sure it was working when I first made the PR! But I think that the last time I rebased it, I broke some logic relating to the fork_gen code. It's probably a quick fix so I'll attempt to debug and fix it at the end of my workday today.

Jun 09 '23 14:06 nevans

I fixed a couple of bugs relating to fork_gen and cur->fiber == NULL. It's passing make test-all for me locally now... we'll see about CI. 🙂

Jun 17 '23 13:06 nevans

So, I talked to @ko1 about this. In summary:

Marking all the fibers/threads/waiters can be slow O(N) when there are thousands of them.
It's preferred to only do the book keeping when scheduling a waiter, i.e. we know a thread is blocked until it is done waiting, and it can't be garbage collected until it's finished. So there is no chance for it to go out of scope.
However, I see value in this work, either as it is (performance impact) OR as an extra pass over objects: after the initial mark, we could have an optional debug step where we validate that all these objects are marked.

In more detail: We know that certain objects should be marked by the implied program flow, but it's (as we've found out) not always true. Adding a debug step "validate the objects were all marked" could have found this issue earlier.

If @ko1 thinks the performance cost is too high (which I'm inclined to agree with but don't have a super strong opinion), maybe we can convert this PR into a supplementary "validate these are all marked" after the initial mark is completed, and only enable this if the vm check mode is set (so only debug builds).

@ko1 what do you think?

Jun 20 '23 22:06 ioquatix

First, thanks to both @ioquatix and @ko1 for your time and consideration! I'm not trying to take up your time: you do not need to answer any of my questions (below) or educate me where my assumptions are wrong (although I do appreciate it). I trust your judgement!

And, if putting this behind a config flag of some sort allows the PR to be accepted, just point me at the flag I should use, and I'll do it. 🙂

Marking all the fibers/threads/waiters can be slow O(N) when there are thousands of them.

In addition to being O(n), linked lists will be much slower to traverse than simply iterating through an array. And, because each node is on a different fiber's stack, it won't play as well with CPU caches as an array on the heap. Also, it wouldn't be good if this somehow triggers frequent and unnecessary traversal of waitqs during minor GCs.

I've been making some assumptions that led me to believe that having N blocked threads or fibers is already O(n) and the cost of this is very small (as a percentage of total GC time). But I won't be surprised if you tell me I'm wrong and it has a significant performance penalty!

My performance assumptions:

I assume that blocked fibers ultimately do need to be marked and the overhead of marking all of these blocked fibers is ultimately already O(n), at least for a major GC pass.
I assumed that running a fiber scheduler with thousands of blocked fibers must already store those fibers in an array somewhere, otherwise it will be causing a SEGV, and that array will already take O(n) to mark.
I assumed that the overhead of running thousands of threads is already so high that traversing a 1000 member linked list once per major GC is imperceptibly small, at least by comparison.
There's an upper bound on total waitq nodes inside the entire process: the total number of fibers. At least until we have Thread.select or Fiber.select and a fiber can wait on more than one mutex/queue at a time (go's runtime does some interesting things with its gosu structs that, I assume, have the primary purpose of minimizing this cost).
I assume that the overhead of marking a fiber or thread is negligible compared to the overhead of marking all the objects referenced by fiber or thread. I assume that a mark function will only be called for an object at most once per GC run, so the overhead of marking an object twice during the same GC run is small. So, wouldn't the overhead of marking the fibers twice (once from some hypothetical array of all blocked fibers, and once from the waitq linked lists) also be small?
I assume that a sync object with a large number of blocked fibers will most likely be in an older generation than most of its blocked fibers. And its even more likely to be in an older generation than other objects with references to its blocked fibers.
And, if the sync object is in a younger generation than the fibers and other objects with references to the fiber, I assumed that it most likely doesn't have very many blocked fibers.
I'm assuming that older generation objects are marked less frequently, and (combined with my previous assumptions) that fibers are less likely to be marked through the sync objects than through other paths.

Separately from performance, I like the semantics of marking fibers from sync objects. Haskell's GC has a nice feature where it can detect one form of deadlocks: Haskell's scheduler intentionally does not store strong references to "fibers" that are blocked without a timeout. That way, if a sync object with waiters is about to be GCed, that means that no "fiber" is capable of waking up any of the waiting "fibers", so instead it wakes up all of the waiting "fibers" with a deadlock exception. I know that ruby GC doesn't have the facilities for this feature (yet?), but marking the fibers from the sync objects would be necessary for it.

Again, thanks for your time and consideration!

Jun 20 '23 23:06 nevans

I assumed that running a fiber scheduler with thousands of blocked fibers must already store those fibers in an array somewhere, otherwise it will be causing a SEGV, and that array will already take O(n) to mark.

This I hope we can fix more efficiently.

Regarding all the other points, it makes sense to me.

Jun 22 '23 03:06 ioquatix