Skip to content

Implement quality treatment for asynchronous actions in multicore#11057

Closed
gadmm wants to merge 6 commits intoocaml:trunkfrom
gadmm:multicore_async_actions
Closed

Implement quality treatment for asynchronous actions in multicore#11057
gadmm wants to merge 6 commits intoocaml:trunkfrom
gadmm:multicore_async_actions

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Feb 24, 2022

Following the detailed action plan at #10915, here is the last step which will make the plan look like it was instructions for "how to draw an owl".

Fixes #10915, ocaml-multicore/ocaml-multicore#791, ocaml-multicore/ocaml-multicore#806, some review comments in this series.

This PR is best reviewed commit-per-commit. Correctness concerns can be provided on a per-commit basis but for discussing design choices it is best to have the whole picture. I can split this PR in several parts, but the design makes sense as a whole.

Please refer to the individual commit logs for more description about what each one does.

I got impatient at the end so I stopped short of reimplementing StatMemprof.

(cc @sadiqj, @ctk21; I think this can also interest people with a very particular set of skills such as @jhjourdan and @stedolan.)

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Feb 24, 2022

It would be nice to run this PR through the benchmark suite. I do not expect performance degradation. However if there's a bug in the way young_limit is reset then this could show up as a perf impact.

caml_interrupt_self();
// FIXME: Is it possible that Caml_state is not accessible from the
// current thread?
caml_set_action_pending();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correctness concern for the code currently in trunk (see 15th commit "do not access TLS in signal handler"). Let me know if it is justified.

If none, done; otherwise, try again */
// FIXME: does this become very slow if a signal is recorded but
// is masked for everybody in capacity of running signals at this
// point?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correctness concern for the code currently in trunk. We test whether any signal is pending, not just ours. It could be that a signal not for us (masked) is recorded and its processing by the target thread takes time. In this situation running caml_process_pending_signals_exn is very expensive because it is guarded by the same test (in addition it contains system calls). Since running it does not end with all signal handlers being called, it is called again in a loop. Let me know if my concerns are justified. See 16th commit about "Do not wait that other threads are processes their signals" for a proposed fix. (cc @xavierleroy)

@gadmm gadmm force-pushed the multicore_async_actions branch from 4cac4cc to 1613e61 Compare February 25, 2022 15:39
@xavierleroy
Copy link
Copy Markdown
Contributor

I just had a quick look at the commit messages, definitely not enough to have an opinion, but just enough to ask questions to help me understand. I'll focus on the "Do not access the TLS from signal handlers" commit.

Interrupt all domains when a signal arrives.

Isn't this very costly if we have many domains? Esp. if most of them / all of them but one block the signal anyway?

In mixed C/OCaml applications there is no guarantee that the POSIX
signal handler runs in an OCaml thread so Caml_state might not be
available.

That's true, and we should harden signal handling against this case (Caml_state will be NULL). Perhaps the broadcast of the signal to all OCaml domains could be done in this case only?

  • Do not make the hypothesis that the thread executing a POSIX signal
    handler is the most ready to execute the corresponding OCaml signal
    handler.

A recommended pattern to use signals with POSIX threads is to handle signals in one thread only, blocking signals in all other threads. I think it's a good way to use signals with OCaml 5 domains too. In which case the POSIX thread that gets the signal definitely is the most ready to execute the OCaml signal handler, because no other thread will.

  • POSIX/Apple's doc is unclear whether TLS is safe to read from signal
    handlers (old implementations do not), but this might be a
    theoretical concern only.

We need to check. macOS uses pthread_getspecific and the other platforms use __thread - modified variables. I would expect both to be usable from signal handlers.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Feb 25, 2022

Interrupt all domains when a signal arrives.

Isn't this very costly if we have many domains? Esp. if most of them / all of them but one block the signal anyway?

With all the fixes from this PR, this causes at most one extra poll per received signal per domain. Having in mind usages like sigalrm and sigint, I consider it negligible.

In mixed C/OCaml applications there is no guarantee that the POSIX
signal handler runs in an OCaml thread so Caml_state might not be
available.

That's true, and we should harden signal handling against this case (Caml_state will be NULL). Perhaps the broadcast of the signal to all OCaml domains could be done in this case only?

Right, and thanks to this we can still keep the simplification of the thread initialization.

  • POSIX/Apple's doc is unclear whether TLS is safe to read from signal
    handlers (old implementations do not), but this might be a
    theoretical concern only.

We need to check. macOS uses pthread_getspecific and the other platforms use __thread - modified variables. I would expect both to be usable from signal handlers.

It sounds like it is possible to access domain_self if made atomic (including macOS since pthread_getspecific is only for Caml_state). (C11 std §7.14.1.1)

Let me know if it is worth the added complexity.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Mar 1, 2022

Thank you for taking a thorough go at this @gadmm . I am reviewing this but there are a lot of changes so it may take a while.

Just a meta comment, I personally feel like some of these commit sets could have been broken up in to separate PRs (e.g the memory orderings ones and fixing TLS from signal handlers) and that might have helped encourage more reviewers, as well as kept the discussions on those PRs more focused.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 1, 2022

A couple of straightforward commits could be done separately, but other commits depend on them, so given that they are straightforward this would only introduce more delays in reviewing the ones that need scrutiny. It is also possible to consider prefixes of the commit set, but unfortunately there are still a lot of dependencies and the design makes sense as a whole. Yes it's a lot of commits but I managed to at least split it in small meaningful commits, so I think it is not so bad given the scale of the change. (I can split the PR into prefixes as you go in your review if you want.)

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

Apart from small comments, I think I'm happy with ba08348 952a19d 6077892 and ae36b9c . These would probably form a pretty uncontroversial PR.

fe14eb8 and cfbbf6f look fine.

Personally I'm not keen on d597f76 . I often misread atomic_store_rel as atomic_store. I prefer keeping the order explicit, though other people might have different views.

The signals TLS changes 289ee88 and c65e08a need to be moved to a separate PR - there's quite a few considerations here, I worry about deadlocks when bringing up new domains. If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).

5fbd234 also needs some discussion, given it affects assembly on all platforms. Presumably only allocations until caml_process_pending_actions_exn is called go through the slow path?

caml_domain_state * dom_st = Caml_state;
/* An interrupt might have been queued in the meanwhile; this
achieves the proper synchronisation. */
atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're ignoring the return value here, why exchange rather than store?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is something I'd like some confirmation from a C memory model expert: even if we ignore the return value, this is stronger than an atomic store and makes sure that the subsequent reads of dom_st are properly synchronized (having in mind a parallel thread that races to interrupt the domain).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative might be a relaxed atomic store followed by an acquire fence before the subsequent reads, but I'd like a confirmation on this as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I am pretty sure about the exchange version (but it might be overkill).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 7, 2022

I have opened #11095 in case you want to approve the first commits separately and focus here on the remaining ones. (edit: I initially considered the first 8 commits there, but I had to cut it to the first 4 due to failing tests.)

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 7, 2022

Personally I'm not keen on d597f76 . I often misread atomic_store_rel as atomic_store. I prefer keeping the order explicit, though other people might have different views.

Well atomic_store_rel is already there, so it does not change anything in this respect right? I do not have any trouble with it. I do not think atomic_{store,load}_relaxed makes anything more ambiguous, and it removes quite a bit of line-wrapping due to overlong lines, to me it counts as a readability gain, this is why I am proposing it.

The signals TLS changes 289ee88 and c65e08a need to be moved to a separate PR - there's quite a few considerations here, I worry about deadlocks when bringing up new domains.

Ok, let us focus on commits up to cfbbf6f first.

If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).

I would need to have a look, can you point me at the code you have in mind?

5fbd234 also needs some discussion, given it affects assembly on all platforms. Presumably only allocations until caml_process_pending_actions_exn is called go through the slow path?

Right, this one can deserve a separate PR. Its presence there clarifies the intended meaning of action_pending.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 7, 2022

In the end the PR containing only the first 4 commits has failed as well, so #11095 now contains everything up to cfbbf6f, hoping that this will get a green light from CI. It was premature to open a separate PR though, since this is forced to include some of the commits that are still under discussion here. Let us keep the discussion inside the present PR.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Mar 8, 2022

If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).

I would need to have a look, can you point me at the code you have in mind?

https://github.com/ocaml/ocaml/blob/trunk/otherlibs%2Fsysthreads%2Fst_stubs.c#L596 we do it here - there's an assumption there that Caml_state is either NULL or a valid pointer to domain state. That said, having a look at the assembly for caml_c_thread_register there's a call to __tls_get_addr which I think can allocate on first access.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 8, 2022

There is nothing wrong with that, the problem is doing it from a POSIX signal handler. Am I missing something?

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Mar 8, 2022

There is nothing wrong with that, the problem is doing it from a POSIX signal handler. Am I missing something?

No, you're right that it's an issue in a signal handler because it might allocate.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 10, 2022

I was intrigued by your mention of __tls_get_addr because I thought C11 thread local variables were async-signal-safe for lock-free atomics (C11 §7.14.1.1). I was mostly worried about the MacOS version which uses POSIX TLS (which is less clear how to make atomic) and concluded it was good enough to use domain_self from runtime/domain.c instead of Caml_state as long as we made it atomic. But now looking further there is indeed an allocation problem. See https://maskray.me/blog/2021-02-14-all-about-thread-local-storage#async-signal-safe-tls for excruciating details.

The problem seems to stem exclusively from loading the thread-local variable dynamically (dlopen). In this case the tLS is dynamically allocated and glibc does a lazy initialization on first access. This is a bit beyond my understanding of how runtime code can be used. Is the dlopen situation applicable at least for domain_self?

@gadmm gadmm force-pushed the multicore_async_actions branch 2 times, most recently from 8ea60a2 to b5db561 Compare March 12, 2022 16:16
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Mar 20, 2022

@kayceesrk CI complains because of failing test weak-ephe-final/finaliser2.ml. Diff shows:

 test1
 test2: 2
 test2: 1
-test3: joined
 test3: parent.1
+test3: joined
 test3: child.1
 test3: child.2

I suspect that the new result is correct and previously it was buggy (assuming one can make assumptions on when finalisers run). Can you please have a look and confirm that the solution is indeed to update the reference file?

If so, I think that this PR was meant to fix this test right from the start, but we see it only now due to the off-by-one error revealed in #11095.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 27, 2022

(Part 1 out of N of this PR has been merged; part 2 out of N is currently proposed for review at #11190.)

gadmm added 2 commits June 9, 2022 02:11
Interrupt all domains when a signal arrives. The point of view of this
commit is that signals (SIGINT, SIGALRM...) arrive infrequently-enough
that this is affordable.

* In mixed C/OCaml applications there is no guarantee that the POSIX
  signal handler runs in an OCaml thread so Caml_state might not be
  available.

* OSX implementation of Caml_state: POSIX/Apple's doc is unclear
  whether explicit TLS is safe to read from signal handlers (old
  implementations do not), but this might be a theoretical concern
  only. (In particular it is unclear how to give it atomic type.)

* While C11 mandates that atomic thread-local variables are
  async-signal-safe for reading, gcc does not conform and can allocate
  in corner cases involving dynamic linking.

* Do not make the hypothesis that the thread executing a POSIX signal
  handler is the most ready to execute the corresponding OCaml signal
  handler. (This is likely unnecessary for idiomatic usage of signals,
  but acting differently depending on whether we can access the
  current domain is more work given the situation around
  async-signal-safety of TLS, e.g. one might be able to access
  domain_self but then the change to make it atomic is invasive.)

* These changes allow simplifications to the spawning of threads and
  domains since we no longer need to mask signals during thread
  creation. (See subsequent commit.)

* The biggest risk is to cause repeated polling in C code that would
  repeatedly poll without clearing pending actions. A subsequent
  commit is proposed to fix this behaviour and make sure that a signal
  notified to a domain cause the minimal amount of polling.
As follows from the preceding commit.
gadmm added 4 commits June 9, 2022 13:52
- Do not slow down allocations when an asynchronous callback cannot be
  handled immediately.

- Do not wait that other threads have processed signals not for us in
  [caml_enter_blocking_section]. (Replaces the implementation from the
  previous commit.)

Before this commit, if a pending action cannot be processed
immediately then all the allocations go through the slow path until it
is processed. It is also possible that we are looping at the start of
blocking sections because a signal is pending but not for us.

This commit introduces a simpler design whereby [action_pending] alone
is used to remember that an asynchronous callback should be run, and
[young_limit] is only used to interrupt running code. The key is to
modify going from C to OCaml code so that [young_limit] is set
appropriately according to [action_pending]. This avoids situations
where we have to repeatedly increase [young_limit] immediately despite
the corresponding action not being processed.

The intuition is that OCaml behaves as if asynchronous actions are
"masked" inside C code (where an "unmasking" action is responsible for
setting [young_limit] according to [action_pending] when the mask
ends).
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jun 9, 2022

(Part 2 out of N of this PR has been merged; part 3 out of N is currently proposed for review at #11307.)

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 14, 2023

The last PR in this series has just been merged, so we can finally close this one.

@gadmm gadmm closed this Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Meta-issue on asynchronous action handling in multicore

3 participants