Skip to content

Make Caml_state NULL while the domain lock is not held#11272

Merged
gasche merged 4 commits intoocaml:trunkfrom
gadmm:caml_try_get_caml_state3
Jul 16, 2022
Merged

Make Caml_state NULL while the domain lock is not held#11272
gasche merged 4 commits intoocaml:trunkfrom
gadmm:caml_try_get_caml_state3

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented May 22, 2022

Here is an alternative 2-line solution, with added benefits, to the problem of checking if the runtime lock is held. See previous discussion at #11138. (cc who took part in previous discussion: @sadiqj, @xavierleroy.)

We set Caml_state to NULL inside of blocking sections, with the added benefit that unprotected access to Caml_state (such as by running some runtime functions without holding the domain lock) is easier to debug.

This PR lives on top of #11271, only the two topmost commits belong to this PR.

From the commit-log:


Caml_state is NULL while the domain lock is not held

  • The domain lock protects access to the domain state. By setting
    Caml_state to NULL inside blocking sections, one increases the
    chances that an incorrect manipulation of the runtime causes an
    early failure, as well as making such bugs easier to debug.

    (Drawback: some unprotected accesses to Caml_state are safe, such as
    accessing Caml_state->id inside blocking sections. Little is lost
    because the programmer who needs it can store the domain ID in a
    thread-local variable.)

  • In addition, this change makes it possible to test whether a thread
    belonging to a domain holds the lock of its domain. This is
    necessary for implementing C functions which are generic on whether
    the domain lock is held or not. By testing whether Caml_state is
    NULL, the programmer can for instance:

    • report a programming error
    • try to acquire the domain lock
    • do something else (e.g. delay an operation until the lock is
      acquired)

    One application is for the OCaml-Rust FFI: in some situations, it is
    not possible to enforce with static typing that some function runs
    only when the runtime lock is held (notably destructors). For these
    situations, this function lets us implement:

    • a dynamic check of whether the lock is held (which plays well with
      the lifetime-based GC safety),
    • a way to acquire the domain lock only if it is not already held.

    In OCaml 4, it is possible to implement this feature on the
    programmer-side (using internals) by customizing the enter/leave
    runtime hooks. Alas, this strategy does not work in OCaml 5. One
    would have to somehow install hooks after systhread is loaded, but
    before the second domain is spawned (after which hooks can no longer
    be modified), and this is very hard to do. So we need a native
    solution for OCaml 5.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 23, 2022

In principle I like the idea, I think it's a nice approach and in particular I like that it indeed improves debuggability.

Caml_state lookups are performance-critical. Is there a risk of deoptimization due to the fact that Caml_state was previously a very-rarely-mutated variable (only on domain startup) and is now modified often? (I would guess not, because we currently use a mix of C and assembly code that compilers can't reason about anyway.)

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 23, 2022

I do not think so. Even assuming the compiler does some kind of interprocedural analysis, it would not know what could happen during enter/leave_blocking_section & co., given that they call hooks. Also there is a disproportion of scales in your concern, accessing Caml_state is very cheap (even with the current MacOS implementation) at the scale at which how often enter/leave_blocking_section are called.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 23, 2022

The scenario I had in mind (which, upon reflection, I agree probably does not exist) is a scenario where a block of code uses Caml_state a lot, and the compiler is able to prove that the value never gets mutated, and now with the code it is not able to prove this anymore because some infrequent code paths (in this block of code) temporarily releases the runtime lock. The problem would not be the cost of taking/releasing the lock, but the pessimization of assumptions that it introduces.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 23, 2022

I like this proposal better than #11138, and I feel "approving" about it, but:

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 23, 2022

The problem would not be the cost of taking/releasing the lock, but the pessimization of assumptions that it introduces.

What I meant is that the compiler's assumptions are already as pessimistic as they can be.

@kayceesrk
Copy link
Copy Markdown
Contributor

I left a number of comments. Overall the changes look ok to me. multicore_systhreads.ml test crashes now and needs to be fixed before we can consider merging the changes.

Generally, I prefer having separate PRs if multiple commits are unrelated. It is easier to review and merge the changes this way, and not have them be blocked on more controversial changes. I find reviewing this style of PR with multiple independent changes much harder than it should be.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 30, 2022

  • Added a missing test file.
  • @kayceesrk's comments will be addressed at Simplifications and fixes to multicore systhreads implementation #11271 later on as said there (including the failing test).
  • Another blocker for this PR would be the part 3/N of the async action PR series, which has a commit that removes unconditional accesses to Caml_state from signal handlers. This fix is needed for multicore anyway.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 1, 2022

Rebased, now that #11271 has been merged. Now one more clearly sees that this is a 2-line change :) I left the Changes in the 5.0 section since people seem to agree to target 5.0.

Keep in mind that this still depends on #11307 (making sure not to access Caml_state from signal handlers at least unconditionally).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 7, 2022

I added a commit “Do not access Caml_state without holding the lock in systhreads” here so that this PR is now independent from all the other PRs on systhreads. (cc @Engil) It is a considerably simplified implementation compared to the last commit of #11271, that does not try to reorganise the systhreads implementation (I leave this to other PRs).

@gasche Would you please run precheck on this?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 7, 2022

I feel guilty of asking @gasche too much, can I please have the rights to run prechecks ?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 7, 2022

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 8, 2022

The timeout in the freebsd testsuite appears unrelated, so I think this passes at last.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 12, 2022

I have a naive question on the code in this PR: which part of st_stubs.c actually runs without the runtime lock? (Is this documented?)

From the before-the-PR-code, here is what I would assume:

  • functions that are careful to call caml_init_domain_self (that is, caml_thread_start and caml_c_thread_register) may run without holding the runtime lock
  • otherwise, nothing else, I would assume?

but if this analysis is correct, then why are there so many changes to the threads code in this PR?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 12, 2022

Thanks for looking at it. Meaningful changes:

  • caml_thread_initialize_domain now initializes domain_id: this ensures the first thread of a domain has correct domain_id info (was not necessary before).
  • caml_thread_leave_blocking_section, caml_thread_start: get the domain_id from the thread info instead of Caml_state->id.
  • caml_c_thread_register, caml_c_thread_unregister: similarly do not access Caml_state before having the lock (this was actually not necessary).

To your analysis you only need to add caml_thread_leave_blocking_section and (counterintuitively) caml_c_thread_unregister which run without the lock.

Anything else is simple refactoring to make st_stubs.c nicer after the change.

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a new review pass, and I am convinced that the change is correct. I also understand better why some of the "refactoring changes" are in fact important to be able to reason about the code.

Could you add a comment at the start of each function that may be invoked without holding the runtime lock? It is natural for leave_blocking_section, not-too-hard-to-reverse-engineer-from-the-code for the two functions I found, and mysterious for caml_c_thread_unregister. In all cases it would help to read the code and reason about it.

gadmm and others added 3 commits July 16, 2022 14:14
- The domain lock protects access to the domain state. By setting
  `Caml_state` to `NULL` inside blocking sections, one increases the
  likelihood that an incorrect manipulation of the runtime fails
  early, and makes such bugs in general easier to debug.

  (Drawback: some unprotected accesses to Caml_state are safe, such as
  accessing Caml_state->id during blocking sections. As a workaround,
  the programmer who really needs it can store the domain ID somewhere
  else.)

- In addition, this change makes it possible to test whether a thread
  belonging to a domain holds the lock of its domain. This is
  necessary for implementing C functions which are generic on whether
  the domain lock is held or not. By testing whether `Caml_state` is
  `NULL`, the programmer can for instance:
  - report a programming error
  - try to acquire the domain lock
  - do something else (e.g. delay an operation until the lock is
    acquired)

  One application is for the OCaml-Rust FFI: in some situations, it is
  not possible to enforce with static typing that some function runs
  only when the runtime lock is held (notably destructors). For these
  situations, this function lets us implement:
  - a dynamic check of whether the lock is held (which plays well with
    the lifetime-based GC safety),
  - a way to acquire the domain lock only if it is not already held.

  In OCaml 4, it is possible to implement this feature on the
  programmer-side (using internals) by customizing the enter/leave
  runtime hooks. Alas, this strategy does not work in OCaml 5. One
  would have to somehow install hooks after systhread is loaded, but
  before the second domain is spawned (after which hooks can no longer
  be modified), and this is very hard to do. So we need a native
  solution for OCaml 5.
@gadmm gadmm force-pushed the caml_try_get_caml_state3 branch from e2a3b59 to f65ea31 Compare July 16, 2022 12:18
@gasche gasche merged commit bc36f00 into ocaml:trunk Jul 16, 2022
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 16, 2022

I merged this in trunk. I would like to get this in 5.0, but I'm waiting a bit in case (despite the precheck test) we find out that all hell breaks loose. I marked this "consider-for-backport" so that we don't forget to backport to 5.0.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 16, 2022

Thanks. Indeed results from opam health check would be nice, but I just saw that it is very broken already.

@damiendoligez
Copy link
Copy Markdown
Member

@gadmm

I feel guilty of asking @gasche too much, can I please have the rights to run prechecks ?

Please create an account on ci.inria.fr and click Join an existing project.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 18, 2022

@damiendoligez I already have such an account. The issue is that I am supposed to see "Build with parameters" but this does not appear.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 21, 2022

@gasche I wonder if it is a good idea to further delay merging into 5.0. Delaying makes it more difficult for the systhreads PR it overlaps with, as it would be more work (and more risk) to merge the PRs in a different order in different branches. In particular I am waiting on this to rebase another PR. Could we have it in 5.0?

You did not say when we would know that it is good to backport, and there is no good reason to have it in 5.1 and not 5.0 (the PR is either for 5.0 or not at all). So this PR is in a weird state where it could be in and at the same time we must do as if it could be reverted anytime. In addition, if it actually does affect some user corner-cases, then this is better revealed by users being able to test it.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 21, 2022

Thanks for the ping, I'm happy to backport now (or: if I forget, at the next ping).

gasche added a commit that referenced this pull request Jul 21, 2022
Make `Caml_state` `NULL` while the domain lock is not held

(cherry picked from commit bc36f00)
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 21, 2022

Backported in 5.0 as 4a58209.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 21, 2022

Thanks everyone!

@damiendoligez
Copy link
Copy Markdown
Member

@damiendoligez I already have such an account. The issue is that I am supposed to see "Build with parameters" but this does not appear.

@gadmm I just gave you more rights on CI, please tell me if that's enough to let you build with parameters.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 6, 2022

Yes–this worked. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants