Make `Caml_state` `NULL` while the domain lock is not held by gadmm · Pull Request #11272 · ocaml/ocaml

gadmm · 2022-05-22T22:43:16Z

Here is an alternative 2-line solution, with added benefits, to the problem of checking if the runtime lock is held. See previous discussion at #11138. (cc who took part in previous discussion: @sadiqj, @xavierleroy.)

We set Caml_state to NULL inside of blocking sections, with the added benefit that unprotected access to Caml_state (such as by running some runtime functions without holding the domain lock) is easier to debug.

This PR lives on top of #11271, only the two topmost commits belong to this PR.

From the commit-log:

Caml_state is NULL while the domain lock is not held

The domain lock protects access to the domain state. By setting
Caml_state to NULL inside blocking sections, one increases the
chances that an incorrect manipulation of the runtime causes an
early failure, as well as making such bugs easier to debug.

(Drawback: some unprotected accesses to Caml_state are safe, such as
accessing Caml_state->id inside blocking sections. Little is lost
because the programmer who needs it can store the domain ID in a
thread-local variable.)
In addition, this change makes it possible to test whether a thread
belonging to a domain holds the lock of its domain. This is
necessary for implementing C functions which are generic on whether
the domain lock is held or not. By testing whether Caml_state is
NULL, the programmer can for instance:
- report a programming error
- try to acquire the domain lock
- do something else (e.g. delay an operation until the lock is
  acquired)
One application is for the OCaml-Rust FFI: in some situations, it is
not possible to enforce with static typing that some function runs
only when the runtime lock is held (notably destructors). For these
situations, this function lets us implement:
- a dynamic check of whether the lock is held (which plays well with
  the lifetime-based GC safety),
- a way to acquire the domain lock only if it is not already held.
In OCaml 4, it is possible to implement this feature on the
programmer-side (using internals) by customizing the enter/leave
runtime hooks. Alas, this strategy does not work in OCaml 5. One
would have to somehow install hooks after systhread is loaded, but
before the second domain is spawned (after which hooks can no longer
be modified), and this is very hard to do. So we need a native
solution for OCaml 5.

gasche · 2022-05-23T06:47:56Z

In principle I like the idea, I think it's a nice approach and in particular I like that it indeed improves debuggability.

Caml_state lookups are performance-critical. Is there a risk of deoptimization due to the fact that Caml_state was previously a very-rarely-mutated variable (only on domain startup) and is now modified often? (I would guess not, because we currently use a mix of C and assembly code that compilers can't reason about anyway.)

gadmm · 2022-05-23T14:00:01Z

I do not think so. Even assuming the compiler does some kind of interprocedural analysis, it would not know what could happen during enter/leave_blocking_section & co., given that they call hooks. Also there is a disproportion of scales in your concern, accessing Caml_state is very cheap (even with the current MacOS implementation) at the scale at which how often enter/leave_blocking_section are called.

gasche · 2022-05-23T14:23:26Z

The scenario I had in mind (which, upon reflection, I agree probably does not exist) is a scenario where a block of code uses Caml_state a lot, and the compiler is able to prove that the value never gets mutated, and now with the code it is not able to prove this anymore because some infrequent code paths (in this block of code) temporarily releases the runtime lock. The problem would not be the cost of taking/releasing the lock, but the pessimization of assumptions that it introduces.

gasche · 2022-05-23T14:24:36Z

I like this proposal better than #11138, and I feel "approving" about it, but:

I would like someone who is actually a runtime expert to have a second look (@kayceesrk. @xavierleroy?)
merging this needs to wait for Simplifications and fixes to multicore systhreads implementation #11271

gadmm · 2022-05-23T14:56:20Z

The problem would not be the cost of taking/releasing the lock, but the pessimization of assumptions that it introduces.

What I meant is that the compiler's assumptions are already as pessimistic as they can be.

otherlibs/systhreads/st_stubs.c

kayceesrk · 2022-05-24T07:56:09Z

I left a number of comments. Overall the changes look ok to me. multicore_systhreads.ml test crashes now and needs to be fixed before we can consider merging the changes.

Generally, I prefer having separate PRs if multiple commits are unrelated. It is easier to review and merge the changes this way, and not have them be blocked on more controversial changes. I find reviewing this style of PR with multiple independent changes much harder than it should be.

gadmm · 2022-05-30T14:11:21Z

Added a missing test file.
@kayceesrk's comments will be addressed at Simplifications and fixes to multicore systhreads implementation #11271 later on as said there (including the failing test).
Another blocker for this PR would be the part 3/N of the async action PR series, which has a commit that removes unconditional accesses to Caml_state from signal handlers. This fix is needed for multicore anyway.

gadmm · 2022-07-01T23:15:55Z

Rebased, now that #11271 has been merged. Now one more clearly sees that this is a 2-line change :) I left the Changes in the 5.0 section since people seem to agree to target 5.0.

Keep in mind that this still depends on #11307 (making sure not to access Caml_state from signal handlers at least unconditionally).

gadmm · 2022-07-07T17:28:24Z

I added a commit “Do not access Caml_state without holding the lock in systhreads” here so that this PR is now independent from all the other PRs on systhreads. (cc @Engil) It is a considerably simplified implementation compared to the last commit of #11271, that does not try to reorganise the systhreads implementation (I leave this to other PRs).

@gasche Would you please run precheck on this?

gadmm · 2022-07-07T17:30:53Z

I feel guilty of asking @gasche too much, can I please have the rights to run prechecks ?

gasche · 2022-07-07T18:56:07Z

https://ci.inria.fr/ocaml/job/precheck/737/

gadmm · 2022-07-08T18:06:00Z

The timeout in the freebsd testsuite appears unrelated, so I think this passes at last.

gasche · 2022-07-12T11:47:35Z

I have a naive question on the code in this PR: which part of st_stubs.c actually runs without the runtime lock? (Is this documented?)

From the before-the-PR-code, here is what I would assume:

functions that are careful to call caml_init_domain_self (that is, caml_thread_start and caml_c_thread_register) may run without holding the runtime lock
otherwise, nothing else, I would assume?

but if this analysis is correct, then why are there so many changes to the threads code in this PR?

gadmm · 2022-07-12T12:07:26Z

Thanks for looking at it. Meaningful changes:

caml_thread_initialize_domain now initializes domain_id: this ensures the first thread of a domain has correct domain_id info (was not necessary before).
caml_thread_leave_blocking_section, caml_thread_start: get the domain_id from the thread info instead of Caml_state->id.
caml_c_thread_register, caml_c_thread_unregister: similarly do not access Caml_state before having the lock (this was actually not necessary).

To your analysis you only need to add caml_thread_leave_blocking_section and (counterintuitively) caml_c_thread_unregister which run without the lock.

Anything else is simple refactoring to make st_stubs.c nicer after the change.

gasche

I did a new review pass, and I am convinced that the change is correct. I also understand better why some of the "refactoring changes" are in fact important to be able to reason about the code.

Could you add a comment at the start of each function that may be invoked without holding the runtime lock? It is natural for leave_blocking_section, not-too-hard-to-reverse-engineer-from-the-code for the two functions I found, and mysterious for caml_c_thread_unregister. In all cases it would help to read the code and reason about it.

- The domain lock protects access to the domain state. By setting `Caml_state` to `NULL` inside blocking sections, one increases the likelihood that an incorrect manipulation of the runtime fails early, and makes such bugs in general easier to debug. (Drawback: some unprotected accesses to Caml_state are safe, such as accessing Caml_state->id during blocking sections. As a workaround, the programmer who really needs it can store the domain ID somewhere else.) - In addition, this change makes it possible to test whether a thread belonging to a domain holds the lock of its domain. This is necessary for implementing C functions which are generic on whether the domain lock is held or not. By testing whether `Caml_state` is `NULL`, the programmer can for instance: - report a programming error - try to acquire the domain lock - do something else (e.g. delay an operation until the lock is acquired) One application is for the OCaml-Rust FFI: in some situations, it is not possible to enforce with static typing that some function runs only when the runtime lock is held (notably destructors). For these situations, this function lets us implement: - a dynamic check of whether the lock is held (which plays well with the lifetime-based GC safety), - a way to acquire the domain lock only if it is not already held. In OCaml 4, it is possible to implement this feature on the programmer-side (using internals) by customizing the enter/leave runtime hooks. Alas, this strategy does not work in OCaml 5. One would have to somehow install hooks after systhread is loaded, but before the second domain is spawned (after which hooks can no longer be modified), and this is very hard to do. So we need a native solution for OCaml 5.

gasche · 2022-07-16T18:44:25Z

I merged this in trunk. I would like to get this in 5.0, but I'm waiting a bit in case (despite the precheck test) we find out that all hell breaks loose. I marked this "consider-for-backport" so that we don't forget to backport to 5.0.

gadmm · 2022-07-16T21:05:16Z

Thanks. Indeed results from opam health check would be nice, but I just saw that it is very broken already.

damiendoligez · 2022-07-18T14:27:08Z

@gadmm

I feel guilty of asking @gasche too much, can I please have the rights to run prechecks ?

Please create an account on ci.inria.fr and click Join an existing project.

gadmm · 2022-07-18T16:18:26Z

@damiendoligez I already have such an account. The issue is that I am supposed to see "Build with parameters" but this does not appear.

gadmm · 2022-07-21T16:56:13Z

@gasche I wonder if it is a good idea to further delay merging into 5.0. Delaying makes it more difficult for the systhreads PR it overlaps with, as it would be more work (and more risk) to merge the PRs in a different order in different branches. In particular I am waiting on this to rebase another PR. Could we have it in 5.0?

You did not say when we would know that it is good to backport, and there is no good reason to have it in 5.1 and not 5.0 (the PR is either for 5.0 or not at all). So this PR is in a weird state where it could be in and at the same time we must do as if it could be reverted anytime. In addition, if it actually does affect some user corner-cases, then this is better revealed by users being able to test it.

gasche · 2022-07-21T18:26:21Z

Thanks for the ping, I'm happy to backport now (or: if I forget, at the next ping).

Make `Caml_state` `NULL` while the domain lock is not held (cherry picked from commit bc36f00)

gasche · 2022-07-21T20:17:32Z

Backported in 5.0 as 4a58209.

gadmm · 2022-07-21T20:42:40Z

Thanks everyone!

damiendoligez · 2022-09-06T15:24:06Z

@damiendoligez I already have such an account. The issue is that I am supposed to see "Build with parameters" but this does not appear.

@gadmm I just gave you more rights on CI, please tell me if that's enough to let you build with parameters.

gadmm · 2022-09-06T16:35:47Z

Yes–this worked. Thanks!

This was referenced May 22, 2022

Add C function caml_thread_has_lock #11138

Closed

Simplifications and fixes to multicore systhreads implementation #11271

Merged