Implement quality treatment for asynchronous actions in multicore#11057
Implement quality treatment for asynchronous actions in multicore#11057gadmm wants to merge 6 commits intoocaml:trunkfrom
Conversation
|
It would be nice to run this PR through the benchmark suite. I do not expect performance degradation. However if there's a bug in the way |
runtime/signals.c
Outdated
| caml_interrupt_self(); | ||
| // FIXME: Is it possible that Caml_state is not accessible from the | ||
| // current thread? | ||
| caml_set_action_pending(); |
There was a problem hiding this comment.
This is a correctness concern for the code currently in trunk (see 15th commit "do not access TLS in signal handler"). Let me know if it is justified.
runtime/signals.c
Outdated
| If none, done; otherwise, try again */ | ||
| // FIXME: does this become very slow if a signal is recorded but | ||
| // is masked for everybody in capacity of running signals at this | ||
| // point? |
There was a problem hiding this comment.
This is a correctness concern for the code currently in trunk. We test whether any signal is pending, not just ours. It could be that a signal not for us (masked) is recorded and its processing by the target thread takes time. In this situation running caml_process_pending_signals_exn is very expensive because it is guarded by the same test (in addition it contains system calls). Since running it does not end with all signal handlers being called, it is called again in a loop. Let me know if my concerns are justified. See 16th commit about "Do not wait that other threads are processes their signals" for a proposed fix. (cc @xavierleroy)
4cac4cc to
1613e61
Compare
|
I just had a quick look at the commit messages, definitely not enough to have an opinion, but just enough to ask questions to help me understand. I'll focus on the "Do not access the TLS from signal handlers" commit.
Isn't this very costly if we have many domains? Esp. if most of them / all of them but one block the signal anyway?
That's true, and we should harden signal handling against this case (Caml_state will be NULL). Perhaps the broadcast of the signal to all OCaml domains could be done in this case only?
A recommended pattern to use signals with POSIX threads is to handle signals in one thread only, blocking signals in all other threads. I think it's a good way to use signals with OCaml 5 domains too. In which case the POSIX thread that gets the signal definitely is the most ready to execute the OCaml signal handler, because no other thread will.
We need to check. macOS uses |
With all the fixes from this PR, this causes at most one extra poll per received signal per domain. Having in mind usages like sigalrm and sigint, I consider it negligible.
Right, and thanks to this we can still keep the simplification of the thread initialization.
It sounds like it is possible to access Let me know if it is worth the added complexity. |
|
Thank you for taking a thorough go at this @gadmm . I am reviewing this but there are a lot of changes so it may take a while. Just a meta comment, I personally feel like some of these commit sets could have been broken up in to separate PRs (e.g the memory orderings ones and fixing TLS from signal handlers) and that might have helped encourage more reviewers, as well as kept the discussions on those PRs more focused. |
|
A couple of straightforward commits could be done separately, but other commits depend on them, so given that they are straightforward this would only introduce more delays in reviewing the ones that need scrutiny. It is also possible to consider prefixes of the commit set, but unfortunately there are still a lot of dependencies and the design makes sense as a whole. Yes it's a lot of commits but I managed to at least split it in small meaningful commits, so I think it is not so bad given the scale of the change. (I can split the PR into prefixes as you go in your review if you want.) |
sadiqj
left a comment
There was a problem hiding this comment.
Right.
Apart from small comments, I think I'm happy with ba08348 952a19d 6077892 and ae36b9c . These would probably form a pretty uncontroversial PR.
fe14eb8 and cfbbf6f look fine.
Personally I'm not keen on d597f76 . I often misread atomic_store_rel as atomic_store. I prefer keeping the order explicit, though other people might have different views.
The signals TLS changes 289ee88 and c65e08a need to be moved to a separate PR - there's quite a few considerations here, I worry about deadlocks when bringing up new domains. If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).
5fbd234 also needs some discussion, given it affects assembly on all platforms. Presumably only allocations until caml_process_pending_actions_exn is called go through the slow path?
| caml_domain_state * dom_st = Caml_state; | ||
| /* An interrupt might have been queued in the meanwhile; this | ||
| achieves the proper synchronisation. */ | ||
| atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start); |
There was a problem hiding this comment.
If we're ignoring the return value here, why exchange rather than store?
There was a problem hiding this comment.
It is something I'd like some confirmation from a C memory model expert: even if we ignore the return value, this is stronger than an atomic store and makes sure that the subsequent reads of dom_st are properly synchronized (having in mind a parallel thread that races to interrupt the domain).
There was a problem hiding this comment.
An alternative might be a relaxed atomic store followed by an acquire fence before the subsequent reads, but I'd like a confirmation on this as well.
There was a problem hiding this comment.
Note: I am pretty sure about the exchange version (but it might be overkill).
|
I have opened #11095 in case you want to approve the first commits separately and focus here on the remaining ones. (edit: I initially considered the first 8 commits there, but I had to cut it to the first 4 due to failing tests.) |
Well
Ok, let us focus on commits up to cfbbf6f first.
I would need to have a look, can you point me at the code you have in mind?
Right, this one can deserve a separate PR. Its presence there clarifies the intended meaning of |
|
In the end the PR containing only the first 4 commits has failed as well, so #11095 now contains everything up to cfbbf6f, hoping that this will get a green light from CI. It was premature to open a separate PR though, since this is forced to include some of the commits that are still under discussion here. Let us keep the discussion inside the present PR. |
https://github.com/ocaml/ocaml/blob/trunk/otherlibs%2Fsysthreads%2Fst_stubs.c#L596 we do it here - there's an assumption there that |
|
There is nothing wrong with that, the problem is doing it from a POSIX signal handler. Am I missing something? |
No, you're right that it's an issue in a signal handler because it might allocate. |
|
I was intrigued by your mention of The problem seems to stem exclusively from loading the thread-local variable dynamically ( |
8ea60a2 to
b5db561
Compare
|
@kayceesrk CI complains because of failing test weak-ephe-final/finaliser2.ml. Diff shows: I suspect that the new result is correct and previously it was buggy (assuming one can make assumptions on when finalisers run). Can you please have a look and confirm that the solution is indeed to update the reference file? If so, I think that this PR was meant to fix this test right from the start, but we see it only now due to the off-by-one error revealed in #11095. |
175079e to
cf72573
Compare
|
(Part 1 out of N of this PR has been merged; part 2 out of N is currently proposed for review at #11190.) |
Interrupt all domains when a signal arrives. The point of view of this commit is that signals (SIGINT, SIGALRM...) arrive infrequently-enough that this is affordable. * In mixed C/OCaml applications there is no guarantee that the POSIX signal handler runs in an OCaml thread so Caml_state might not be available. * OSX implementation of Caml_state: POSIX/Apple's doc is unclear whether explicit TLS is safe to read from signal handlers (old implementations do not), but this might be a theoretical concern only. (In particular it is unclear how to give it atomic type.) * While C11 mandates that atomic thread-local variables are async-signal-safe for reading, gcc does not conform and can allocate in corner cases involving dynamic linking. * Do not make the hypothesis that the thread executing a POSIX signal handler is the most ready to execute the corresponding OCaml signal handler. (This is likely unnecessary for idiomatic usage of signals, but acting differently depending on whether we can access the current domain is more work given the situation around async-signal-safety of TLS, e.g. one might be able to access domain_self but then the change to make it atomic is invasive.) * These changes allow simplifications to the spawning of threads and domains since we no longer need to mask signals during thread creation. (See subsequent commit.) * The biggest risk is to cause repeated polling in C code that would repeatedly poll without clearing pending actions. A subsequent commit is proposed to fix this behaviour and make sure that a signal notified to a domain cause the minimal amount of polling.
As follows from the preceding commit.
…caml_enter_blocking_section]
- Do not slow down allocations when an asynchronous callback cannot be handled immediately. - Do not wait that other threads have processed signals not for us in [caml_enter_blocking_section]. (Replaces the implementation from the previous commit.) Before this commit, if a pending action cannot be processed immediately then all the allocations go through the slow path until it is processed. It is also possible that we are looping at the start of blocking sections because a signal is pending but not for us. This commit introduces a simpler design whereby [action_pending] alone is used to remember that an asynchronous callback should be run, and [young_limit] is only used to interrupt running code. The key is to modify going from C to OCaml code so that [young_limit] is set appropriately according to [action_pending]. This avoids situations where we have to repeatedly increase [young_limit] immediately despite the corresponding action not being processed. The intuition is that OCaml behaves as if asynchronous actions are "masked" inside C code (where an "unmasking" action is responsible for setting [young_limit] according to [action_pending] when the mask ends).
Setting action_pending is cheap now.
cf72573 to
783a8d4
Compare
|
(Part 2 out of N of this PR has been merged; part 3 out of N is currently proposed for review at #11307.) |
|
The last PR in this series has just been merged, so we can finally close this one. |
Following the detailed action plan at #10915, here is the last step which will make the plan look like it was instructions for "how to draw an owl".
Fixes #10915, ocaml-multicore/ocaml-multicore#791, ocaml-multicore/ocaml-multicore#806, some review comments in this series.
This PR is best reviewed commit-per-commit. Correctness concerns can be provided on a per-commit basis but for discussing design choices it is best to have the whole picture. I can split this PR in several parts, but the design makes sense as a whole.
Please refer to the individual commit logs for more description about what each one does.
I got impatient at the end so I stopped short of reimplementing StatMemprof.
(cc @sadiqj, @ctk21; I think this can also interest people with a very particular set of skills such as @jhjourdan and @stedolan.)