Implement quality treatment for asynchronous actions in multicore by gadmm · Pull Request #11057 · ocaml/ocaml

gadmm · 2022-02-24T18:49:01Z

Following the detailed action plan at #10915, here is the last step which will make the plan look like it was instructions for "how to draw an owl".

Fixes #10915, ocaml-multicore/ocaml-multicore#791, ocaml-multicore/ocaml-multicore#806, some review comments in this series.

This PR is best reviewed commit-per-commit. Correctness concerns can be provided on a per-commit basis but for discussing design choices it is best to have the whole picture. I can split this PR in several parts, but the design makes sense as a whole.

Please refer to the individual commit logs for more description about what each one does.

I got impatient at the end so I stopped short of reimplementing StatMemprof.

(cc @sadiqj, @ctk21; I think this can also interest people with a very particular set of skills such as @jhjourdan and @stedolan.)

gadmm · 2022-02-24T19:12:11Z

It would be nice to run this PR through the benchmark suite. I do not expect performance degradation. However if there's a bug in the way young_limit is reset then this could show up as a perf impact.

gadmm · 2022-02-24T19:16:57Z

runtime/signals.c

-  caml_interrupt_self();
+  // FIXME: Is it possible that Caml_state is not accessible from the
+  // current thread?
+  caml_set_action_pending();


This is a correctness concern for the code currently in trunk (see 15th commit "do not access TLS in signal handler"). Let me know if it is justified.

gadmm · 2022-02-24T19:27:14Z

runtime/signals.c

       If none, done; otherwise, try again */
+    // FIXME: does this become very slow if a signal is recorded but
+    // is masked for everybody in capacity of running signals at this
+    // point?


This is a correctness concern for the code currently in trunk. We test whether any signal is pending, not just ours. It could be that a signal not for us (masked) is recorded and its processing by the target thread takes time. In this situation running caml_process_pending_signals_exn is very expensive because it is guarded by the same test (in addition it contains system calls). Since running it does not end with all signal handlers being called, it is called again in a loop. Let me know if my concerns are justified. See 16th commit about "Do not wait that other threads are processes their signals" for a proposed fix. (cc @xavierleroy)

xavierleroy · 2022-02-25T17:02:33Z

I just had a quick look at the commit messages, definitely not enough to have an opinion, but just enough to ask questions to help me understand. I'll focus on the "Do not access the TLS from signal handlers" commit.

Interrupt all domains when a signal arrives.

Isn't this very costly if we have many domains? Esp. if most of them / all of them but one block the signal anyway?

In mixed C/OCaml applications there is no guarantee that the POSIX
signal handler runs in an OCaml thread so Caml_state might not be
available.

That's true, and we should harden signal handling against this case (Caml_state will be NULL). Perhaps the broadcast of the signal to all OCaml domains could be done in this case only?

Do not make the hypothesis that the thread executing a POSIX signal
handler is the most ready to execute the corresponding OCaml signal
handler.

A recommended pattern to use signals with POSIX threads is to handle signals in one thread only, blocking signals in all other threads. I think it's a good way to use signals with OCaml 5 domains too. In which case the POSIX thread that gets the signal definitely is the most ready to execute the OCaml signal handler, because no other thread will.

POSIX/Apple's doc is unclear whether TLS is safe to read from signal
handlers (old implementations do not), but this might be a
theoretical concern only.

We need to check. macOS uses pthread_getspecific and the other platforms use __thread - modified variables. I would expect both to be usable from signal handlers.

gadmm · 2022-02-25T19:18:09Z

Interrupt all domains when a signal arrives.

Isn't this very costly if we have many domains? Esp. if most of them / all of them but one block the signal anyway?

With all the fixes from this PR, this causes at most one extra poll per received signal per domain. Having in mind usages like sigalrm and sigint, I consider it negligible.

In mixed C/OCaml applications there is no guarantee that the POSIX
signal handler runs in an OCaml thread so Caml_state might not be
available.

That's true, and we should harden signal handling against this case (Caml_state will be NULL). Perhaps the broadcast of the signal to all OCaml domains could be done in this case only?

Right, and thanks to this we can still keep the simplification of the thread initialization.

POSIX/Apple's doc is unclear whether TLS is safe to read from signal
handlers (old implementations do not), but this might be a
theoretical concern only.

We need to check. macOS uses pthread_getspecific and the other platforms use __thread - modified variables. I would expect both to be usable from signal handlers.

It sounds like it is possible to access domain_self if made atomic (including macOS since pthread_getspecific is only for Caml_state). (C11 std §7.14.1.1)

Let me know if it is worth the added complexity.

sadiqj · 2022-03-01T13:55:36Z

Thank you for taking a thorough go at this @gadmm . I am reviewing this but there are a lot of changes so it may take a while.

Just a meta comment, I personally feel like some of these commit sets could have been broken up in to separate PRs (e.g the memory orderings ones and fixing TLS from signal handlers) and that might have helped encourage more reviewers, as well as kept the discussions on those PRs more focused.

gadmm · 2022-03-01T14:44:24Z

A couple of straightforward commits could be done separately, but other commits depend on them, so given that they are straightforward this would only introduce more delays in reviewing the ones that need scrutiny. It is also possible to consider prefixes of the commit set, but unfortunately there are still a lot of dependencies and the design makes sense as a whole. Yes it's a lot of commits but I managed to at least split it in small meaningful commits, so I think it is not so bad given the scale of the change. (I can split the PR into prefixes as you go in your review if you want.)

sadiqj

Right.

Apart from small comments, I think I'm happy with ba08348 952a19d 6077892 and ae36b9c . These would probably form a pretty uncontroversial PR.

fe14eb8 and cfbbf6f look fine.

Personally I'm not keen on d597f76 . I often misread atomic_store_rel as atomic_store. I prefer keeping the order explicit, though other people might have different views.

The signals TLS changes 289ee88 and c65e08a need to be moved to a separate PR - there's quite a few considerations here, I worry about deadlocks when bringing up new domains. If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).

5fbd234 also needs some discussion, given it affects assembly on all platforms. Presumably only allocations until caml_process_pending_actions_exn is called go through the slow path?

sadiqj · 2022-03-04T14:41:01Z

runtime/domain.c

+  caml_domain_state * dom_st = Caml_state;
+  /* An interrupt might have been queued in the meanwhile; this
+     achieves the proper synchronisation. */
+  atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start);


If we're ignoring the return value here, why exchange rather than store?

It is something I'd like some confirmation from a C memory model expert: even if we ignore the return value, this is stronger than an atomic store and makes sure that the subsequent reads of dom_st are properly synchronized (having in mind a parallel thread that races to interrupt the domain).

An alternative might be a relaxed atomic store followed by an acquire fence before the subsequent reads, but I'd like a confirmation on this as well.

Note: I am pretty sure about the exchange version (but it might be overkill).

runtime/alloc.c

runtime/caml/memory.h

gadmm · 2022-03-07T12:10:29Z

I have opened #11095 in case you want to approve the first commits separately and focus here on the remaining ones. (edit: I initially considered the first 8 commits there, but I had to cut it to the first 4 due to failing tests.)

gadmm · 2022-03-07T14:36:42Z

Personally I'm not keen on d597f76 . I often misread atomic_store_rel as atomic_store. I prefer keeping the order explicit, though other people might have different views.

Well atomic_store_rel is already there, so it does not change anything in this respect right? I do not have any trouble with it. I do not think atomic_{store,load}_relaxed makes anything more ambiguous, and it removes quite a bit of line-wrapping due to overlong lines, to me it counts as a readability gain, this is why I am proposing it.

The signals TLS changes 289ee88 and c65e08a need to be moved to a separate PR - there's quite a few considerations here, I worry about deadlocks when bringing up new domains.

Ok, let us focus on commits up to cfbbf6f first.

If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).

I would need to have a look, can you point me at the code you have in mind?

5fbd234 also needs some discussion, given it affects assembly on all platforms. Presumably only allocations until caml_process_pending_actions_exn is called go through the slow path?

Right, this one can deserve a separate PR. Its presence there clarifies the intended meaning of action_pending.

gadmm · 2022-03-07T14:54:17Z

In the end the PR containing only the first 4 commits has failed as well, so #11095 now contains everything up to cfbbf6f, hoping that this will get a green light from CI. It was premature to open a separate PR though, since this is forced to include some of the commits that are still under discussion here. Let us keep the discussion inside the present PR.

sadiqj · 2022-03-08T15:22:57Z

If we can't rely on checking Caml_state for NULL then I'm also not convinced we couldn't do something simpler like round-robin interrupt domains (via all_domains).

I would need to have a look, can you point me at the code you have in mind?

https://github.com/ocaml/ocaml/blob/trunk/otherlibs%2Fsysthreads%2Fst_stubs.c#L596 we do it here - there's an assumption there that Caml_state is either NULL or a valid pointer to domain state. That said, having a look at the assembly for caml_c_thread_register there's a call to __tls_get_addr which I think can allocate on first access.

gadmm · 2022-03-08T17:12:54Z

There is nothing wrong with that, the problem is doing it from a POSIX signal handler. Am I missing something?

sadiqj · 2022-03-08T19:27:04Z

There is nothing wrong with that, the problem is doing it from a POSIX signal handler. Am I missing something?

No, you're right that it's an issue in a signal handler because it might allocate.

gadmm · 2022-03-10T14:50:29Z

I was intrigued by your mention of __tls_get_addr because I thought C11 thread local variables were async-signal-safe for lock-free atomics (C11 §7.14.1.1). I was mostly worried about the MacOS version which uses POSIX TLS (which is less clear how to make atomic) and concluded it was good enough to use domain_self from runtime/domain.c instead of Caml_state as long as we made it atomic. But now looking further there is indeed an allocation problem. See https://maskray.me/blog/2021-02-14-all-about-thread-local-storage#async-signal-safe-tls for excruciating details.

The problem seems to stem exclusively from loading the thread-local variable dynamically (dlopen). In this case the tLS is dynamically allocated and glibc does a lazy initialization on first access. This is a bit beyond my understanding of how runtime code can be used. Is the dlopen situation applicable at least for domain_self?

gadmm · 2022-03-20T17:48:24Z

@kayceesrk CI complains because of failing test weak-ephe-final/finaliser2.ml. Diff shows:

 test1
 test2: 2
 test2: 1
-test3: joined
 test3: parent.1
+test3: joined
 test3: child.1
 test3: child.2

I suspect that the new result is correct and previously it was buggy (assuming one can make assumptions on when finalisers run). Can you please have a look and confirm that the solution is indeed to update the reference file?

If so, I think that this PR was meant to fix this test right from the start, but we see it only now due to the off-by-one error revealed in #11095.

gadmm · 2022-04-27T17:20:58Z

(Part 1 out of N of this PR has been merged; part 2 out of N is currently proposed for review at #11190.)

Interrupt all domains when a signal arrives. The point of view of this commit is that signals (SIGINT, SIGALRM...) arrive infrequently-enough that this is affordable. * In mixed C/OCaml applications there is no guarantee that the POSIX signal handler runs in an OCaml thread so Caml_state might not be available. * OSX implementation of Caml_state: POSIX/Apple's doc is unclear whether explicit TLS is safe to read from signal handlers (old implementations do not), but this might be a theoretical concern only. (In particular it is unclear how to give it atomic type.) * While C11 mandates that atomic thread-local variables are async-signal-safe for reading, gcc does not conform and can allocate in corner cases involving dynamic linking. * Do not make the hypothesis that the thread executing a POSIX signal handler is the most ready to execute the corresponding OCaml signal handler. (This is likely unnecessary for idiomatic usage of signals, but acting differently depending on whether we can access the current domain is more work given the situation around async-signal-safety of TLS, e.g. one might be able to access domain_self but then the change to make it atomic is invasive.) * These changes allow simplifications to the spawning of threads and domains since we no longer need to mask signals during thread creation. (See subsequent commit.) * The biggest risk is to cause repeated polling in C code that would repeatedly poll without clearing pending actions. A subsequent commit is proposed to fix this behaviour and make sure that a signal notified to a domain cause the minimal amount of polling.

As follows from the preceding commit.

…caml_enter_blocking_section]

- Do not slow down allocations when an asynchronous callback cannot be handled immediately. - Do not wait that other threads have processed signals not for us in [caml_enter_blocking_section]. (Replaces the implementation from the previous commit.) Before this commit, if a pending action cannot be processed immediately then all the allocations go through the slow path until it is processed. It is also possible that we are looping at the start of blocking sections because a signal is pending but not for us. This commit introduces a simpler design whereby [action_pending] alone is used to remember that an asynchronous callback should be run, and [young_limit] is only used to interrupt running code. The key is to modify going from C to OCaml code so that [young_limit] is set appropriately according to [action_pending]. This avoids situations where we have to repeatedly increase [young_limit] immediately despite the corresponding action not being processed. The intuition is that OCaml behaves as if asynchronous actions are "masked" inside C code (where an "unmasking" action is responsible for setting [young_limit] according to [action_pending] when the mask ends).

Setting action_pending is cheap now.

gadmm · 2022-06-09T16:10:08Z

(Part 2 out of N of this PR has been merged; part 3 out of N is currently proposed for review at #11307.)

gadmm · 2023-09-14T12:00:45Z

The last PR in this series has just been merged, so we can finally close this one.

gadmm commented Feb 24, 2022

View reviewed changes

gadmm force-pushed the multicore_async_actions branch from 4cac4cc to 1613e61 Compare February 25, 2022 15:39

sadiqj reviewed Mar 6, 2022

View reviewed changes

gadmm mentioned this pull request Mar 7, 2022

Implement quality treatment for asynchronous actions in multicore (1/N) #11095

Merged

gadmm force-pushed the multicore_async_actions branch 2 times, most recently from 8ea60a2 to b5db561 Compare March 12, 2022 16:16

gadmm mentioned this pull request Mar 26, 2022

Meta-issue for OCaml 5.0 release goals #11013

Closed

16 tasks

gadmm force-pushed the multicore_async_actions branch 3 times, most recently from 175079e to cf72573 Compare April 13, 2022 18:07

gadmm mentioned this pull request Apr 13, 2022

Implement quality treatment for asynchronous actions in multicore (2/N) #11190

Merged

kayceesrk mentioned this pull request Apr 16, 2022

Make finaliser2 test deterministic #11201

Merged

gadmm mentioned this pull request Apr 27, 2022

Tidy-up the sigmask code in domain.c #11219

Merged

gadmm added 2 commits June 9, 2022 02:11

Simplify spawning of threads and domains

8ebb689

As follows from the preceding commit.

gadmm added 4 commits June 9, 2022 13:52

Do not raise from caml_thread_yield

2ead128

Do not wait that other threads have processed signals not for us in […

4fbd122

…caml_enter_blocking_section]

Check actions unconditionally when leaving a blocking section

783a8d4

Setting action_pending is cheap now.

gadmm force-pushed the multicore_async_actions branch from cf72573 to 783a8d4 Compare June 9, 2022 13:29

gadmm mentioned this pull request Jun 9, 2022

Implement quality treatment for asynchronous actions in multicore (3/3?) #11307

Merged

gasche mentioned this pull request Jan 18, 2023

statmemprof is absent in OCaml 5.0 #11911

Closed

gasche mentioned this pull request Mar 31, 2023

Meta-issue on asynchronous action handling in multicore #10915

Closed

20 tasks

gadmm closed this Sep 14, 2023

Conversation

gadmm commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Feb 24, 2022

Uh oh!

gadmm Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

gadmm Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Feb 25, 2022

Uh oh!

gadmm commented Feb 25, 2022

Uh oh!

sadiqj commented Mar 1, 2022

Uh oh!

gadmm commented Mar 1, 2022

Uh oh!

sadiqj left a comment

Choose a reason for hiding this comment

Uh oh!

sadiqj Mar 4, 2022

Choose a reason for hiding this comment

Uh oh!

gadmm Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

gadmm Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

gadmm Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gadmm commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Mar 7, 2022

Uh oh!

sadiqj commented Mar 8, 2022

Uh oh!

gadmm commented Mar 8, 2022

Uh oh!

sadiqj commented Mar 8, 2022

Uh oh!

gadmm commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Mar 20, 2022

Uh oh!

gadmm commented Apr 27, 2022

Uh oh!

gadmm commented Jun 9, 2022

Uh oh!

gadmm commented Sep 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gadmm commented Feb 24, 2022 •

edited

Loading

gadmm commented Mar 7, 2022 •

edited

Loading

gadmm commented Mar 7, 2022 •

edited

Loading

gadmm commented Mar 10, 2022 •

edited

Loading