Simplifications and fixes to multicore systhreads implementation by gadmm · Pull Request #11271 · ocaml/ocaml

gadmm · 2022-05-22T22:22:46Z

Here are various fixes and simplifications to the multicore systhreads implementation (cc @Engil).

From the commit logs:

Do not run `caml_thread_yield` from the backup thread.

This behaviour was likely unintended given that caml_thread_yield runs
signal handlers, which can execute arbitrary OCaml code.

Simplifications to multicore systhreads

Use a thread-local variable this_thread instead of explicit TLS;
prefer using this_thread over Current_thread.
Remove redundant All_threads circular list and remove a FIXME.
Never access Caml_state while the domain lock is not held.

The main motivation is 3. (in preparation of #11272). But the
fact that this is done by simplifying the code is nice.

Remove some dead code

Here, we are never the last thread of the domain.

Fix some incorrect comment and restore some comments from pre-multicore systhreads implementation.

This is best reviewed commit-per-commit. No change entry needed.

dra27 · 2022-05-23T08:25:26Z

IIUC the first commit fixes a bug, the remaining ones are (welcome!) re-workings in systhreads. If that's right, please could the first commit be separated as I think that's then definitely a 5.0 change, where the rest might have to be 5.1?

gadmm · 2022-05-23T14:09:25Z

On whether this is necessary for 5.0. The "bug" part fixed by the first commit is also fixed by the parts of the PR series on reworking async callbacks for multicore that are already necessary for 5.0 (by not calling into OCaml inside caml_thread_yield). So I would not say that the first one is necessary for 5.0 by itself (unless calling caml_thread_yield in the backup thread causes more issues than I know). So this PR is not in itself necessary for 5.0. On the other hand, if people like the approach at #11272 which depends on this PR, then it should be in for 5.0 (API change), in which case this PR is also necessary for 5.0.

gasche · 2022-05-23T14:35:08Z

Personally I think of such "documentation / cleanups / refactoring" PRs (I wrote a couple myself) as a continuation of the review effort for the multicore merge, which has happened in an unusual way with several "big picture" discussions but much less "looking at the details" review work than typical PRs. So: in scope for "whenever before or after the merge", including now -- and probably a better long-term use of our collective time than giving up on documentations and rushing to finish new feature X or Y in time for 5.0.

But of course it's a matter of finding reviewers able to do a review. Here I wonder if @Engil would be available? (Or maybe @ctk21?) We need someone (else than @gadmm) who knows how threads work in Multicore. Personally I don't; is there some clear documentation in the implementation already?

gadmm · 2022-05-23T14:45:49Z

Personally I don't; is there some clear documentation in the implementation already?

I do not claim to be an expert; but it is fairly similar to the pre-multicore systhreads implementation, with some subtlety in the interaction with the backup thread.

gadmm · 2022-05-23T14:52:47Z

otherlibs/systhreads/st_stubs.c

 void caml_thread_interrupt_hook(void)
 {
+  /* Do not attempt to yield from the backup thread */
+  if (caml_bt_is_in_blocking_section()) return;


CI failure was hard to reproduce, but when I managed to do it, the backtrace pointed fingers at this test not being enough to prevent caml_thread_yield from running from the backup thread. Do you have ideas (@Engil? @ctk21?) for a better way of excluding the backup thread here?
(To reproduce the bug reliably, it turns out that you either need an Intel CPU, or to put load on your AMD CPU and be patient. I'd be curious at an explanation of the differences between processors.)

@kayceesrk The CI failure at multicore_systhreads.ml appears to come from here, let me know if you have an idea.

Very nice catch!

Indeed, this is a problematic situation, and after investigating for a bit I figured why this solution does not work.
The reason is that, there can be a timing where the mutator thread can request re-entry into the runtime system while the backup thread is processing interrupts.

backup_thread_message is then set to BT_ENTERING_OCAML, and the backup thread will yield control once it is done processing the interrupts.

However, in this scenario, it means that caml_bt_is_in_blocking_section is not a reliable way to check the invariant of whether we are running from the backup thread or not.

Likely we need a separate way to check for this invariant (that we do not have right now). Maybe a thread local value to check?

Addendum: caml_bt_is_in_blocking_section was not meant as a way to check whether we are the backup thread or not.
The purpose is rather to check around the masterlock what is the proper course of action. (ie. is there a thread waiting to run? should we signal the backup thread if not?)

Thanks for investigating. Naive question (I did not look at the code), is it not possible to test backup_thread_message for the two values? Otherwise a new thread-local variable is fine too.

@gadmm I don't think this would work: because either the backup thread is processing (IN_BLOCKING_SECTION), or is about to yield, or already waiting on the domain lock (ENTERING_OCAML). (and then there's also BT_TERMINATING and BT_INIT for lifecycle purposes.)
This would probably always return true for any thread calling it.

I think we do need to expose a way to tell if a thread is a backup thread, though there may be a nicer solution that eludes me.

xavierleroy · 2022-05-23T15:01:28Z

I'd be more inclined to review if the motivations were given for some of the changes. For example "Use a thread-local variable this_thread instead of explicit TLS; prefer using this_thread over Current_thread." doesn't tell me why a thread-local variable is preferable: is that for performance? safety (w.r.t. signals for example)? portability?.

gadmm · 2022-05-23T15:29:04Z

this_thread is more efficient than the previous Current_thread which goes through a Caml_state->id indexed table. (But I won't claim the performance difference is measurable.) I guess the circonvolution was due to the fact that accessing the explicit TLS was syntactically heavy. I imagine that using explicit TLS was inspired by the previous implementation and perhaps previous problems with thread_local on MacOS, but here using a static thread_local is just fine for multicore. This also removes some lines of code, and removes a couple of corner cases in caml_register/unregister_c_thread. It adds one inside caml_thread_scan_root but I'd argue that this clarifies code (it makes explicit that we can be in the backup thread). This change is also motivated by not accessing Caml_state without the domain lock. The change is pretty superficial otherwise.

The change from Caml_state->id-based indexing to this_thread->domain_id-based indexing forces to move some code around in the same function (when this_thread is not yet initialized or is reset). The motivation here is again to not access Caml_state without holding the domain lock.

Merging All_threads with Current_thread also makes it more natural in two places (where lines of code are removed).

In this file, the accesses to Caml_state->id without holding the domain lock were safe, so part of the motivation will stem from how people feel about #11272.

gadmm · 2022-05-23T15:30:02Z

(Thanks for asking)

gadmm · 2022-05-24T13:59:42Z

Thanks for the review @kayceesrk! I think you have inadvertently written your comments in the wrong thread (#11272). I am a bit busy for the coming days but I will address your comments in the present PR.

abbysmal · 2022-06-09T13:32:41Z

Sorry for the long delay answering this, @gadmm, I was off for a while.
I am back now and will review this in the coming days.

This behaviour was likely unintended given that caml_thread_yield runs signal handlers, which can execute arbitrary OCaml code.

1. Use a thread-local variable this_thread instead of explicit pthread TLS. This lets us avoid circonvolutions. 2. Prefer using this_thread over Active_thread (trivial). 3. Never access Caml_state while the domain lock is not held. The main motivation is 3. (in preparation of a subsequent PR). But the fact that this is done by simplifying the code is nice too.

abbysmal

I just did a first pass of review.

I feel like the bug fix is nice, and I understand the overall rationale behind some simplifications.
However at times I feel like the code is less explicit than it used to be.

For instance, the this_thread and Current_thread situation is a bit hard to understand, I can't see a consistency in when to use one over the other (?).

Also some of these simplifications I am not super confident about and will need a deeper look into the corner cases removed by this PR.

I commented in a few places I was not sure I understood well.

abbysmal · 2022-06-21T12:38:29Z

otherlibs/systhreads/st_stubs.c

  Current_thread = new_thread;
  Tick_thread_running = 0;

+  st_masterlock_init(&Thread_main_lock);


Why was the initialization of the lock moved toward the end?
Is it a style decision?

It is necessary because you need this_thread to be initialized to use the macro. I have made it better in the new version.

abbysmal · 2022-06-21T13:01:42Z

otherlibs/systhreads/st_stubs.c

-
-  /* If no other OCaml thread remains, ask the tick thread to stop
-     so that it does not prevent the whole process from exiting (#9971) */
-  if (All_threads == NULL) caml_thread_cleanup(Val_unit);


Was this really dead code?

IIUC, this follows from the comment /* The main domain thread does not go through [caml_thread_stop]. There is always one more thread in the chain at this point in time. */ and the assert above. Let me know if I missed something.
But this is surprising indeed, because this means that the tick thread is never stopped. I have removed this commit and I will reintroduce it in a PR that also fixes stopping the tick thread during domain stop.

gadmm · 2022-06-21T13:32:00Z

I have found a few more fixes to do so I have split the PR in two (i.e. I removed some commits and will reintroduce them in another PR).

I have renamed Current_thread as proposed by @kayceesrk. I have also checked that no other function is unexpectedly callable from the backup thread.

I have found a nice solution to caml_thread_yield, please have a look.

Please accept a small tooling documentation commit, it is unrelated except for the fact that debugging the issue with caml_thread_yield had me to try and find one time too much the information which I wrote in the commit.

gadmm · 2022-06-21T13:56:39Z

For instance, the this_thread and Current_thread situation is a bit hard to understand, I can't see a consistency in when to use one over the other (?).

[Current_thread -> Active_thread]
I tried to use this_thread whenever possible (because it does not go through an indirection), and Active_thread only when we really mean it, never as a synonymous for this_thread when the two match by chance. This should make the distinction between the two clearer. I have a further commit for a later PR where I clarified the invariant concerning Active_thread notably by using the same set_active function everywhere (this forces to fix another bug, so I leave it for later).

Also some of these simplifications I am not super confident about and will need a deeper look into the corner cases removed by this PR.

One commit was split in two so that should make it easier. I have removed other commits to deal with later.

gadmm · 2022-06-21T14:16:57Z

(The other PR: gadmm/ocaml@systhread_simpl_and_fixes...gadmm:systhread_simpl_and_fixes2)

gadmm · 2022-06-30T14:26:15Z

This PR had a first round of review by @Engil and @kayceesrk, the comments have been addressed (I think). The PR has been simplified in the hope that we could converge quickly and move to the second PR I mentioned (which I think fixes the turning-off of the tick threads) as well as #11272 which depends on this PR. Is that ok with you?

kayceesrk · 2022-07-01T04:21:19Z

otherlibs/systhreads/st_stubs.c

  /* If no other OCaml thread remains, ask the tick thread to stop
     so that it does not prevent the whole process from exiting (#9971) */
-  if (All_threads == NULL) caml_thread_cleanup(Val_unit);
+  if (Active_thread == NULL) caml_thread_cleanup(Val_unit);


Given the earlier comment about the main thread not going through caml_thread_stop and the assertion that there is at least one other thread, is it possible at all that Active_thread becomes NULL here?

see #11271 (comment)

kayceesrk

The PR fixes a bug and includes a couple of improvements (avoiding the use of explicit pthread TLS in favour of thread-local variable). While I would have liked each of these to be separate PRs, each one of these improvements is small enough and split into individual commits that they can go in as they are.

I just have one open question #11271 (review).

gadmm · 2022-07-01T14:40:17Z

Thanks, let's discuss your question in the next PR.

CI is green, no change entry is needed.

gadmm · 2022-07-01T22:20:53Z

Thanks, could you please cherry-pick on 5.0?

gasche · 2022-07-02T05:13:13Z

I would normally think twice about merging a subtle PR to the release branch without going through our glorious release manager @Octachron, but this is 5.0-the-alpha-release and reviewers agreed that this fixes a bug, so let's go for it.

Simplifications and fixes to multicore systhreads implementation (cherry picked from commit 475dc68)

xavierleroy · 2022-07-02T06:44:35Z

That may have been unwise. Jenkins CI is rather unhappy with these commits (https://ci.inria.fr/ocaml/job/main/3090/):

Alpine Linux x86_64
    tests/parallel/'fib_threads.ml' with 1.1 (bytecode) Process 17083 got signal 11(Segmentation fault), no core dumped
    tests/lib-systhreads/'multicore_lifecycle.ml' with 1.1 (bytecode) Process 13556 got signal 11(Segmentation fault), no core dumped

PPC64

    tests/lib-systhreads/'multicore_lifecycle.ml' with 1.1 (bytecode) Timeout expired, killing all child processes
    tests/parallel/'multicore_systhreads.ml' with 1.1 (bytecode) Timeout expired, killing all child processes
    tests/parallel/'fib_threads.ml' with 1.1 (bytecode) Timeout expired, killing all child processes

PPC32
    tests/lib-systhreads/'multicore_lifecycle.ml' with 1.1 (bytecode) Timeout expired, killing all child processes
    tests/parallel/'multicore_systhreads.ml' with 1.1 (bytecode) Timeout expired, killing all child processes
    tests/parallel/'fib_threads.ml' with 1.1 (bytecode) Timeout expired, killing all child processes

There's also a boatload of Windows failures, but possibly unrelated.

Time to debug...

gasche · 2022-07-02T14:54:25Z

Yep, both 5.0 and trunk appear to fail CI. (The testsuite passes with 5.0 on my machine, i checked before pushing).

I will revert the cherry-pick commit from 5.0 for now to un-break that CI. I don't know if we should revert from trunk as well.

This reverts commit be33545. This change was found to break the CI: https://ci.inria.fr/ocaml/job/main/3091/

gasche · 2022-07-02T14:56:34Z

@gadmm this PR appears to break bytecode tests (with a segfault) under some architectures including Alpine-x86_64 (that one is easy to test from Docker I suppose).

dra27 · 2022-07-02T15:07:10Z

I just pushed 373c941 to trunk which is a missing part of #11304, but that doesn't eliminate the Windows problems caused by this PR on Jenkins either.

I'd suggest we revert this on trunk as well (I'm also taking another look at restoring more of the Windows CI testing, as that would have been a useful canary here)

kayceesrk · 2022-07-02T15:30:30Z

I agree with reverting the change on 5.0 and trunk. I’ll study the failure early next week if @gadmm does not get to it earlier.

xavierleroy · 2022-07-02T16:11:29Z

For what it's worth, here is what GDB says about the Alpine segfault in fib_thread:

Thread 14 "ocamlrun" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 30081]
0x00007fffe7e5e1db in caml_thread_restore_runtime_state () at st_stubs.c:183
183	  Caml_state->current_stack = Active_thread->current_stack;
(gdb) bt
#0  0x00007fffe7e5e1db in caml_thread_restore_runtime_state ()
    at st_stubs.c:183
#1  0x00007fffe7e5e6db in caml_thread_stop () at st_stubs.c:450
#2  caml_thread_start (v=<optimized out>) at st_stubs.c:483
#3  0x00007ffff7fba221 in ?? () from /lib/ld-musl-x86_64.so.1
#4  0x0000000000000000 in ?? ()

Same context for multicore_lifecycle.

This reverts commit 475dc68, reversing changes made to 6f9a0a0. The PR appears to break the CI, see https://ci.inria.fr/ocaml/job/main/3090/

gasche · 2022-07-02T19:40:43Z

Thanks for the feedback, the PR is now reverted from both 5.0 and trunk.

Once the bug are found, we need to open another PR (github won't less us "reopen" a reverted PR), and to run precheck on it to test the INRIA CI before merging.

(I wonder why the Github CI did not observe the issue. Once we understand what was at fault, maybe we will be able to strengthen the CI.)

gadmm · 2022-07-02T22:56:50Z

I am looking into it. I could bisect the failure on Alpine to the third commit. I opened #11390 to run the INRIA CI on it the other commits and check if the other failures come from the same commit. Could someone please run precheck on it?

gadmm · 2022-07-03T21:14:34Z

Thanks everyone for your assistance. I have opened #11393 to fix the bug (see more details there). So now we have 3 PRs for 5.0 (in order: #11390 -> #11385 -> #11393) and 1 PR for 5.1 only (#11386).

gadmm changed the title ~~Systhread simpl and fixes~~ Simplifications and fixes to multicore systhreads implementation May 22, 2022

gadmm mentioned this pull request May 22, 2022

Make Caml_state NULL while the domain lock is not held #11272

Merged

gadmm commented May 23, 2022

View reviewed changes

kayceesrk assigned kayceesrk and unassigned kayceesrk May 24, 2022

abbysmal self-assigned this Jun 9, 2022

gadmm added 4 commits June 20, 2022 15:24

Do not run caml_thread_yield from the backup thread.

26aca90

This behaviour was likely unintended given that caml_thread_yield runs signal handlers, which can execute arbitrary OCaml code.

Remove redundant All_threads circular list and remove a FIXME

ac28ffc

Improve documentation of make in testsuite

cd4aa47

abbysmal reviewed Jun 21, 2022

View reviewed changes

gadmm force-pushed the systhread_simpl_and_fixes branch from bd753d4 to cd4aa47 Compare June 21, 2022 13:23

kayceesrk reviewed Jul 1, 2022

View reviewed changes

kayceesrk approved these changes Jul 1, 2022

View reviewed changes

kayceesrk added this to the 5.0 milestone Jul 1, 2022

kayceesrk added the no-change-entry-needed label Jul 1, 2022

gasche merged commit 475dc68 into ocaml:trunk Jul 1, 2022

This was referenced Jul 1, 2022

Simplifications and fixes to multicore systhreads implementation (2/3) #11385

Merged

Simplifications and fixes to multicore systhreads implementation (3/3) #11386

Merged

gasche added a commit that referenced this pull request Jul 2, 2022

Merge pull request #11271 from gadmm/systhread_simpl_and_fixes

be33545

Simplifications and fixes to multicore systhreads implementation (cherry picked from commit 475dc68)

gasche added a commit that referenced this pull request Jul 2, 2022

Revert "Merge pull request #11271 from gadmm/systhread_simpl_and_fixes"

cefebf3

This reverts commit be33545. This change was found to break the CI: https://ci.inria.fr/ocaml/job/main/3091/

gasche added a commit that referenced this pull request Jul 2, 2022

Revert "Merge pull request #11271 from gadmm/systhread_simpl_and_fixes"

fdcbde0

This reverts commit 475dc68, reversing changes made to 6f9a0a0. The PR appears to break the CI, see https://ci.inria.fr/ocaml/job/main/3090/

dra27 mentioned this pull request Jul 2, 2022

Appveyor tests #11389

Merged

gadmm mentioned this pull request Jul 2, 2022

Simplifications and fixes to multicore systhreads implementation (0.5/3) #11390

Merged

gadmm mentioned this pull request Jul 3, 2022

Simplifications and fixes to multicore systhreads implementation (2.5/3) #11393

Closed

Conversation

gadmm commented May 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Do not run caml_thread_yield from the backup thread.

Simplifications to multicore systhreads

Remove some dead code

Fix some incorrect comment and restore some comments from pre-multicore systhreads implementation.

Uh oh!

dra27 commented May 23, 2022

Uh oh!

gadmm commented May 23, 2022

Uh oh!

gasche commented May 23, 2022

Uh oh!

gadmm commented May 23, 2022

Uh oh!

gadmm May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented May 23, 2022

Uh oh!

gadmm commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented May 23, 2022

Uh oh!

gadmm commented May 24, 2022

Uh oh!

abbysmal commented Jun 9, 2022

Uh oh!

abbysmal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadmm commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jun 21, 2022

Uh oh!

gadmm commented Jun 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayceesrk left a comment

Choose a reason for hiding this comment

Uh oh!

gadmm commented Jul 1, 2022

Uh oh!

gadmm commented Jul 1, 2022

Uh oh!

gasche commented Jul 2, 2022

Uh oh!

xavierleroy commented Jul 2, 2022

Uh oh!

gasche commented Jul 2, 2022

Uh oh!

gasche commented Jul 2, 2022

gadmm commented May 22, 2022 •

edited

Loading

Do not run `caml_thread_yield` from the backup thread.

gadmm May 23, 2022 •

edited

Loading

gadmm commented May 23, 2022 •

edited

Loading

gadmm commented Jun 21, 2022 •

edited

Loading

gadmm commented Jun 21, 2022 •

edited

Loading

kayceesrk commented Jul 2, 2022 •

edited

Loading