Make Caml_state NULL while the domain lock is not held#11272
Make Caml_state NULL while the domain lock is not held#11272gasche merged 4 commits intoocaml:trunkfrom
Caml_state NULL while the domain lock is not held#11272Conversation
|
In principle I like the idea, I think it's a nice approach and in particular I like that it indeed improves debuggability. Caml_state lookups are performance-critical. Is there a risk of deoptimization due to the fact that Caml_state was previously a very-rarely-mutated variable (only on domain startup) and is now modified often? (I would guess not, because we currently use a mix of C and assembly code that compilers can't reason about anyway.) |
|
I do not think so. Even assuming the compiler does some kind of interprocedural analysis, it would not know what could happen during enter/leave_blocking_section & co., given that they call hooks. Also there is a disproportion of scales in your concern, accessing Caml_state is very cheap (even with the current MacOS implementation) at the scale at which how often enter/leave_blocking_section are called. |
|
The scenario I had in mind (which, upon reflection, I agree probably does not exist) is a scenario where a block of code uses Caml_state a lot, and the compiler is able to prove that the value never gets mutated, and now with the code it is not able to prove this anymore because some infrequent code paths (in this block of code) temporarily releases the runtime lock. The problem would not be the cost of taking/releasing the lock, but the pessimization of assumptions that it introduces. |
|
I like this proposal better than #11138, and I feel "approving" about it, but:
|
What I meant is that the compiler's assumptions are already as pessimistic as they can be. |
|
I left a number of comments. Overall the changes look ok to me. Generally, I prefer having separate PRs if multiple commits are unrelated. It is easier to review and merge the changes this way, and not have them be blocked on more controversial changes. I find reviewing this style of PR with multiple independent changes much harder than it should be. |
|
e109038 to
168cf9f
Compare
168cf9f to
2b9b8d0
Compare
|
Rebased, now that #11271 has been merged. Now one more clearly sees that this is a 2-line change :) I left the Keep in mind that this still depends on #11307 (making sure not to access |
2b9b8d0 to
781ba1b
Compare
|
I added a commit “Do not access Caml_state without holding the lock in systhreads” here so that this PR is now independent from all the other PRs on systhreads. (cc @Engil) It is a considerably simplified implementation compared to the last commit of #11271, that does not try to reorganise the systhreads implementation (I leave this to other PRs). @gasche Would you please run precheck on this? |
|
I feel guilty of asking @gasche too much, can I please have the rights to run prechecks ? |
|
The timeout in the freebsd testsuite appears unrelated, so I think this passes at last. |
|
I have a naive question on the code in this PR: which part of From the before-the-PR-code, here is what I would assume:
but if this analysis is correct, then why are there so many changes to the threads code in this PR? |
|
Thanks for looking at it. Meaningful changes:
To your analysis you only need to add Anything else is simple refactoring to make st_stubs.c nicer after the change. |
gasche
left a comment
There was a problem hiding this comment.
I did a new review pass, and I am convinced that the change is correct. I also understand better why some of the "refactoring changes" are in fact important to be able to reason about the code.
Could you add a comment at the start of each function that may be invoked without holding the runtime lock? It is natural for leave_blocking_section, not-too-hard-to-reverse-engineer-from-the-code for the two functions I found, and mysterious for caml_c_thread_unregister. In all cases it would help to read the code and reason about it.
- The domain lock protects access to the domain state. By setting
`Caml_state` to `NULL` inside blocking sections, one increases the
likelihood that an incorrect manipulation of the runtime fails
early, and makes such bugs in general easier to debug.
(Drawback: some unprotected accesses to Caml_state are safe, such as
accessing Caml_state->id during blocking sections. As a workaround,
the programmer who really needs it can store the domain ID somewhere
else.)
- In addition, this change makes it possible to test whether a thread
belonging to a domain holds the lock of its domain. This is
necessary for implementing C functions which are generic on whether
the domain lock is held or not. By testing whether `Caml_state` is
`NULL`, the programmer can for instance:
- report a programming error
- try to acquire the domain lock
- do something else (e.g. delay an operation until the lock is
acquired)
One application is for the OCaml-Rust FFI: in some situations, it is
not possible to enforce with static typing that some function runs
only when the runtime lock is held (notably destructors). For these
situations, this function lets us implement:
- a dynamic check of whether the lock is held (which plays well with
the lifetime-based GC safety),
- a way to acquire the domain lock only if it is not already held.
In OCaml 4, it is possible to implement this feature on the
programmer-side (using internals) by customizing the enter/leave
runtime hooks. Alas, this strategy does not work in OCaml 5. One
would have to somehow install hooks after systhread is loaded, but
before the second domain is spawned (after which hooks can no longer
be modified), and this is very hard to do. So we need a native
solution for OCaml 5.
e2a3b59 to
f65ea31
Compare
|
I merged this in trunk. I would like to get this in 5.0, but I'm waiting a bit in case (despite the precheck test) we find out that all hell breaks loose. I marked this "consider-for-backport" so that we don't forget to backport to 5.0. |
|
Thanks. Indeed results from opam health check would be nice, but I just saw that it is very broken already. |
|
@damiendoligez I already have such an account. The issue is that I am supposed to see "Build with parameters" but this does not appear. |
|
@gasche I wonder if it is a good idea to further delay merging into 5.0. Delaying makes it more difficult for the systhreads PR it overlaps with, as it would be more work (and more risk) to merge the PRs in a different order in different branches. In particular I am waiting on this to rebase another PR. Could we have it in 5.0? You did not say when we would know that it is good to backport, and there is no good reason to have it in 5.1 and not 5.0 (the PR is either for 5.0 or not at all). So this PR is in a weird state where it could be in and at the same time we must do as if it could be reverted anytime. In addition, if it actually does affect some user corner-cases, then this is better revealed by users being able to test it. |
|
Thanks for the ping, I'm happy to backport now (or: if I forget, at the next ping). |
Make `Caml_state` `NULL` while the domain lock is not held (cherry picked from commit bc36f00)
|
Backported in 5.0 as 4a58209. |
|
Thanks everyone! |
@gadmm I just gave you more rights on CI, please tell me if that's enough to let you build with parameters. |
|
Yes–this worked. Thanks! |
Here is an alternative 2-line solution, with added benefits, to the problem of checking if the runtime lock is held. See previous discussion at #11138. (cc who took part in previous discussion: @sadiqj, @xavierleroy.)
We set
Caml_statetoNULLinside of blocking sections, with the added benefit that unprotected access toCaml_state(such as by running some runtime functions without holding the domain lock) is easier to debug.This PR lives on top of #11271, only the two topmost commits belong to this PR.
From the commit-log:
Caml_state is NULL while the domain lock is not held
The domain lock protects access to the domain state. By setting
Caml_state to NULL inside blocking sections, one increases the
chances that an incorrect manipulation of the runtime causes an
early failure, as well as making such bugs easier to debug.
(Drawback: some unprotected accesses to Caml_state are safe, such as
accessing Caml_state->id inside blocking sections. Little is lost
because the programmer who needs it can store the domain ID in a
thread-local variable.)
In addition, this change makes it possible to test whether a thread
belonging to a domain holds the lock of its domain. This is
necessary for implementing C functions which are generic on whether
the domain lock is held or not. By testing whether Caml_state is
NULL, the programmer can for instance:
acquired)
One application is for the OCaml-Rust FFI: in some situations, it is
not possible to enforce with static typing that some function runs
only when the runtime lock is held (notably destructors). For these
situations, this function lets us implement:
the lifetime-based GC safety),
In OCaml 4, it is possible to implement this feature on the
programmer-side (using internals) by customizing the enter/leave
runtime hooks. Alas, this strategy does not work in OCaml 5. One
would have to somehow install hooks after systhread is loaded, but
before the second domain is spawned (after which hooks can no longer
be modified), and this is very hard to do. So we need a native
solution for OCaml 5.