Simplifications and fixes to multicore systhreads implementation (3/3) by gadmm · Pull Request #11386 · ocaml/ocaml

gadmm · 2022-07-01T22:54:31Z

Follow-up to #11271, #11385 and #11393 (cc @Engil, @kayceesrk).

Targets 5.1 (does not seem to fix important bugs). Details in the individual commit logs. Can you please have a look at the commit that removes dead code, in case this was unintended and hides a bug? I also added a FIXME regarding a behaviour from old systhreads that is not yet implemented in multicore (see comments below).

This PR lives on top of #11393, so only the 5 topmost commits belong to it.

No change entry needed.

gadmm · 2022-07-03T21:23:09Z

I have rebased on top of #11393, so keep in mind that only the 5 topmost commits belong to this PR.

gadmm · 2022-07-03T21:24:09Z

otherlibs/systhreads/st_stubs.c


+/* Cleanup the thread machinery when the runtime is shut down. Joining the tick
+   thread take 25ms on average / 50ms in the worst case, so we don't do it on
+   program exit. (FIXME) */


I restored this comment from old systhreads. However, this behaviour does not seem implemented yet in multicore.

gadmm · 2022-07-06T17:26:11Z

I rebased on top of #11385 and added some new minor fixes and clarifications. It still only contains less important fixes targeting 5.1. For clarity I convert it as a draft, we should review it when earlier PRs have been merged.

I added some more FIXMEs that I think are worth looking into.

(of course a round of precheck is welcome)

gasche · 2022-07-10T14:55:31Z

In case anyone wonders (I just did), the current "PR path" appears to be:

already merged:
next to review:
- Simplifications and fixes to multicore systhreads implementation (2/3) #11385
further PRs:
- Simplifications and fixes to multicore systhreads implementation (3/3) #11386 (the present PR)

gadmm · 2023-01-27T20:59:54Z

This should be rebased and made ready for review. While adapting to the new way systhread is initialized on every thread, I stumbled into a rabbit hole regarding error handling inside Domain.spawn and found other bugs with signals. I am not yet sure how to go about it, I just wanted to post an update.

gadmm · 2023-03-06T15:49:18Z

I have rebased on top of trunk. This could do with a round of review. Instructions for reviewers:

This is best reviewed commit-per-commit, with the help of additional detail in each commit log.
If someone knows systhread well, please review that commit "Remove dead code" does not reveal a regression in functionality compared to OCaml 4 (code that became dead for wrong reasons).
A few FIXMEs are added, coming from my review of the OCaml 5 systhread implem. and are about possible regressions wrt. OCaml 4. I do not intend or expect to have to fix these FIXMEs in this PR.
- Some FIXME concern possible changes/regressions wrt. the "best effort" to support fork on a single domain.
- One FIXME is about the delay for joining the tick threads on shutdown (25ms on average). On OCaml 4 there was no wait. When reviewing the code, it appeared to me that a solution could be to interrupt the sleep system call in the tick thread using the self-pipe trick with pselect (edit: or simply using pthread_cond_timedwait).
One FIXME is added to avoid delaying this PR further: the FIXME in domain_thread_func is about proper error handling inside domain initialization. Domain initialization was moved from OCaml to C code at the end of 5.0 development, which has created new issues. Errors during domain initialization should be converted into exceptions; currently this probably crashes or deadlocks. As said in my previous comment, fixing this has revealed to be a bit of a rabbit hole, it is doable but is better done separately in order to make progress on this PR. However, reviewers are free to disagree and refuse the addition of a FIXME comment in the code (in which case we can discuss solutions here).

gasche · 2023-03-06T17:49:01Z

Do we have a volunteer to review this PR? I have assigned myself as "person in charge of making sure we don't forget about it completely", but I would be happy to let someone else do the review.

kayceesrk · 2023-05-27T03:39:41Z

I'm starting to review this PR, but I have a couple of difficulties in the way this PR and others like #11307 are organised which is making it harder for me to review the PR(s).

My main difficulty is that the PRs suggest many improvements that seem independent but are organised as several commits, each of which has to be read one after the other. The details of the improvements are in the commit messages. This has several downsides that make it harder to review such PRs.

On Github, the discussions on several unrelated improvements get mixed together in the same PR thread which makes it harder to follow along.
If some of the early improvements are OK, but the follow-ups need changes, the whole PR gets blocked. As a reviewer, I am not able to checkpoint my approval on individual commits. I'm forced to read the entire commit chain in order to see how these interact with each other.
IINM, this is also different from how OCaml compiler development operates today, which is a PR for every improvement that's independent.

Personally, I really find it hard to review PRs commit by commit.

May I recommend @gadmm to consider having separate PRs for each improvement so that it makes reviewing easier? Small improvements will go much faster with a PR per commit. If they are inter-related, please mention that it builds on top of another PR. Also, possibly consider marking the PRs that are targets in a dependence chain as drafts so that reviewer time is not wasted on reviewing PRs which may change based on whether the source PRs require changes. Reviewer time is precious and limited, and it would help if the PRs are organised to be reviewer friendly.

kayceesrk

The first commit looks good to me. I've left some comments for optional improvements.

kayceesrk · 2023-05-27T04:06:45Z

otherlibs/systhreads/st_pthreads.h

+ out_err2:
+  pthread_mutex_destroy(&m->lock);
+ out_err:
+  return rc;


No change necessary, but noting that since we don't know which operation resulted in the non-0 return value, all we can conclude is that 0 is success and non-0 is an error. In particular, one cannot usefully inspect the error number.

I think we can simplify this by thinking about which errors would be useful to ultimately signal to the mutator -- there can only really be Out_of_memory or another exception to signal that a thread cannot be created (to distinguish between memory exhaustion and other system limits like thread limits).

In this function, the only way the pthread_mutex_init or pthread_mutex_cond functions can fail is with a memory or resource error (EAGAIN or ENOMEM), since the functions aren't called in a way that could return EINVAL. So just returning a negative error code of -1 to indicate an OOM condition would be simpler than propagating the results of one of the mutex calls failing, I think.

It is true that you do not know which one failed, but e.g. in case of ENOMEM this does not really matter.

The docs also mention EBUSY and the function is indeed used to reinitialize a mutex after fork, so since what is done for fork+threads is not really standards-compliant, it is best to keep the explicit error message for debugging. After spending an hour trying to figure out the best course of action here, I am thinking we might be overthinking it.

Raising an exception à la sync_check_error (which correctly maps ENOMEM to Out_of_memory) is doable, after fixing the handling of errors during domain creation.

kayceesrk · 2023-05-27T04:08:45Z

otherlibs/systhreads/st_stubs.c

-  m->init = 0; /* force initialization */
-  st_masterlock_init(m);
+  m->init = 0; /* force reinitialization */
+  if (0 != st_masterlock_init(m))


Is 0 != st_masterlock_init(m) better than st_masterlock_init(m) != 0?

The house style in the runtime is mostly the latter, with only a few sprinklings of the (0 == ...) style that I can find in runtime/sys.c

Thanks, fixed. Comments about style are welcome, to help fix the inconsistency of the code base.

kayceesrk · 2023-05-27T04:10:09Z

otherlibs/systhreads/st_stubs.c

-  st_masterlock_init(m);
+  m->init = 0; /* force reinitialization */
+  if (0 != st_masterlock_init(m))
+    caml_fatal_error("Unix.fork: failed to reinitialize master lock");


I can see that the proposed change is better than what was earlier. Is there no better recovery here than a caml_fatal_error?

Is even caml_fatal_error safe here, given the earlier giant "only execute async-signal-safe operations" comment in the diff hunk above? If we've failed to do something as simple as allocate a few mutexes at this point in the fork, just exiting is probably the only option.

Indeed if it fails, there is not much that can be done at that point. In case there is something to improve to the "best effort" to support fork+threads, it is still better to attempt writing an error message. Note that when trying to support fork+threads, one is already deep in unchartered territory (with some hope that the global lock ensures enough consistency to let you ignore the "only async-signal-safe" requirement). Of course, the whole enterprise is doomed to failure if there are other C threads and one is interrupted in the middle of malloc, say.

kayceesrk · 2023-05-27T04:11:58Z

otherlibs/systhreads/st_stubs.c

 }

-static void caml_thread_domain_initialize_hook(void)
+/* Returns 0 if initialization failed */


This semantics is the opposite of the semantics of other thread operations like pthread_mutex_init and pthread_cond_init where 0 is success and non-zero is an error. Wondered whether there was a reason to do it this way?

Agreed with @kayceesrk. It's the opposite of the convention established in st_masterlock_init in this diff.

The reasoning was that the other ones are internal, while this one is exported in some sense (my view was probably influenced by hacking heavily the runtime hooks in the past).

The new version raises an exception instead. This is mostly useful when initializing threads on the first domain, which was already prepared to receive an exception.

When fixing domain_thread_func to handle errors, this will need to be changed to returning an encoded exception.

Since I feel that my comment was not very clear: I am backtracking a bit on returning an error code and I keep the exception-raising behaviour of caml_stat_alloc. I add a check to raise an exception on faulty mutex initialization. This lets me improve the behaviour when initializing threads on the first domain, without fixing or making worse the behaviour on initialization for other domains. This last fix will have to come in another PR (as mentioned already).

runtime/domain.c

otherlibs/systhreads/st_stubs.c

otherlibs/systhreads/thread.ml

otherlibs/systhreads/st_stubs.c

kayceesrk · 2023-05-27T06:56:29Z

otherlibs/systhreads/st_stubs.c


+/* Cleanup the thread machinery when the runtime is shut down. Joining the tick
+   thread take 25ms on average / 50ms in the worst case, so we don't do it on
+   program exit. (FIXME: not implemented in OCaml 5 yet) */


~~IIUC, the fix is to not join the systhread since it might be sleeping?~~ I noticed that the commit messages confirms this.

Note, as I mentioned in #11386 (comment), that not joining is more problematic for multicore, and I suggested the use of pthread_cond_timedwait with a condition variable used to request continuing or shutting down as a possible solution.

otherlibs/systhreads/st_stubs.c

kayceesrk · 2023-05-27T07:01:46Z

testsuite/tests/lib-unix/common/multicore_fork_domain_alone.ml

-  | _ -> print_endline "NOK"
-  end;
-  Domain.join d
+  let res = match Unix.fork () with


This is much better. Thanks!

kayceesrk · 2023-05-27T07:03:11Z

Except for this one comment on the second commit (142beb8#r1207685036), the
other changes look fine to me.

avsm · 2023-05-27T15:39:48Z

otherlibs/systhreads/st_stubs.c

  /* First initialise the systhread chain on this domain */
-  caml_thread_domain_initialize_hook();
+  if (!caml_thread_domain_initialize_hook())
+    caml_failwith("caml_thread_initialize: Thread initialization failed.");


It would be really nice to distinguish between an OOM (Out_of_memory) and a thread allocation fail (Failure) here, if possible.

Wish granted. Note that properly dealing with those exceptions for domains ≥ 1 will be more work, as mentioned.

otherlibs/unix/fork.c

avsm · 2023-05-27T15:45:07Z

otherlibs/unix/unix.mli

+   process, the pid of the child process for the parent process. It
+   fails if the OCaml process is multi-core (any domain has been
+   spawned). In addition, if any threads have been spawned then the
+   child process might be in a corrupt state.


This isn't entirely clear on how other threads have been spawned, given the invariant earlier. Do you mean system threads independently spawned of the OCaml runtime?

Yes, I think this sentence refers to pthreads that would have been created by C code or other foreign code in the current process.

Also systhreads, which is what I meant originally. C/FFI programmers already know not to mix fork & threads. I clarified the comment.

gasche · 2023-06-15T20:05:01Z

@gadmm it looks like the present PR has been reviewed three weeks ago, and would be mergeable in trunk after you rebase and take review comments into account. Are you available to do this in the next few weeks? If not, would you prefer if a maintainer took care of the rebase?

gadmm · 2023-06-16T16:31:49Z

As I understand it, this is not scheduled for 5.1 so there is no rush.

kayceesrk · 2023-06-27T08:36:45Z

As I understand it, this is not scheduled for 5.1 so there is no rush.

It would be useful to get this towards a merge soon as the code has already gone out of sync with trunk and needs to be rebased. Merging it faster would mean that fixing conflicting changes would be the responsibility of folks who propose newer PRs ;-)

gadmm · 2023-06-27T16:06:23Z

Thanks for the gentle pressure, I am well aware and this has been at the top of my pile. As you can guess my own time is constrained, and I also received many solicitations from the OCaml github concurrently. I have been prioritizing 5.1. (Even though the release schedule is kept private for some odd reason, I have some sense that 5.1 is the priority at this time.)

Still on the topic of what we could do better to make sure that aging PRs end up getting merged, would monthly reminders be received positively? Would they help?

gadmm · 2023-07-09T15:56:32Z

The "fixup" comments are for reviewing my fixes. They will be squashed later.

kayceesrk · 2023-07-10T11:56:13Z

Except for this one comment on the second commit (142beb8#r1207685036), the
other changes look fine to me.

Force push lost my comment ?! :-( I don't know what my comment was, but I am assuming that it was fixed. If @gadmm can confirm, I think the PR is good to go.

gadmm · 2023-07-17T17:54:48Z

The comment is still there right above (#11386 (comment)), along with my reply, but it looks like your pointer was invalidated. (I got the still-valid link by doing "copy link" directly from the PR discussion, which seems to give a different URL than yours.)

Let me know about the last point, and then I will clean-up the PR.

kayceesrk · 2023-07-18T04:00:27Z

Let me know about the last point, and then I will clean-up the PR.

LGTM. Please also add a changes entry. I'll aim to merge this right after that.

This revealed further issues with the new way domains are initialized, hence the addition of a FIXME.

In this code path, if Thread_lock(Dom_c_threads) is well-defined, then Active_thread is non-null since both variables are initialized at the same moment. As with the previous one, this should be looked at to check if this does not hide a more serious bug.

The tick thread no longer uses signals.

Take advantage of the fact that the domain_state is passed as an argument

…have been forgotten Add a FIXME for a feature that is missing from OCaml 5 (avoid joining on the tick thread at shutdown).

We change the order of spawning of the tick thread so that in case of (very improbable) failure, we do not end up with both an exception and a thread being spawned.

gadmm · 2023-07-18T16:43:22Z

Rebased & added a Changes entry (mostly to credit the work done by reviewers, as no change entry was needed for these minor changes & fixes IMO).

Edit: as already mentioned in a different discussion, the general advice on the organisation of this PR, which mentions various points that were already applied, and did not seem related to the delay in starting the review, ended up discussed in private.

gadmm force-pushed the systhread_simpl_and_fixes3 branch from c4228dc to 40e25fc Compare July 3, 2022 20:52

gadmm mentioned this pull request Jul 3, 2022

Simplifications and fixes to multicore systhreads implementation #11271

Merged

gadmm commented Jul 3, 2022

View reviewed changes

gadmm force-pushed the systhread_simpl_and_fixes3 branch 2 times, most recently from b2d65d7 to e5a8f1e Compare July 6, 2022 17:23

gadmm marked this pull request as draft July 6, 2022 17:26

gadmm mentioned this pull request Jul 10, 2022

Simplifications and fixes to multicore systhreads implementation (2.5/3) #11393

Closed

gadmm mentioned this pull request Aug 2, 2022

Simplifications and fixes to multicore systhreads implementation (2/3) #11385

Merged

gadmm force-pushed the systhread_simpl_and_fixes3 branch from e5a8f1e to 9b109bf Compare March 6, 2023 15:13

gadmm marked this pull request as ready for review March 6, 2023 15:50

gasche self-assigned this Mar 6, 2023

gadmm mentioned this pull request May 26, 2023

possible memory-leak on failed domain backup thread installation #12269

Closed

kayceesrk reviewed May 27, 2023

View reviewed changes

otherlibs/systhreads/st_stubs.c Show resolved Hide resolved

kayceesrk reviewed May 27, 2023

View reviewed changes

otherlibs/systhreads/thread.ml Show resolved Hide resolved

kayceesrk reviewed May 27, 2023

View reviewed changes

otherlibs/systhreads/st_stubs.c Show resolved Hide resolved

kayceesrk reviewed May 27, 2023

View reviewed changes

otherlibs/systhreads/st_stubs.c Show resolved Hide resolved

kayceesrk reviewed May 27, 2023

View reviewed changes

otherlibs/systhreads/st_stubs.c Outdated Show resolved Hide resolved

kayceesrk reviewed May 27, 2023

View reviewed changes

kayceesrk approved these changes May 27, 2023

View reviewed changes

avsm reviewed May 27, 2023

View reviewed changes

otherlibs/unix/fork.c Outdated Show resolved Hide resolved

avsm reviewed May 27, 2023

View reviewed changes

kayceesrk self-assigned this Jun 16, 2023

gadmm force-pushed the systhread_simpl_and_fixes3 branch from ef90b1c to ae46a4a Compare July 9, 2023 15:54

gadmm and others added 6 commits July 18, 2023 18:08

Deal with errors in st_masterlock_init (e.g. when forking)

a21739a

This revealed further issues with the new way domains are initialized, hence the addition of a FIXME.

Remove dead code

2271e56

In this code path, if Thread_lock(Dom_c_threads) is well-defined, then Active_thread is non-null since both variables are initialized at the same moment. As with the previous one, this should be looked at to check if this does not hide a more serious bug.

Remove signal handler for preemption

e5cc7e1

The tick thread no longer uses signals.

[minor] Access domain_state instead of Caml_state in scan function

14e0bbf

Take advantage of the fact that the domain_state is passed as an argument

[minor] Improve comments and restore some comments from OCaml 4 that …

9371ee0

…have been forgotten Add a FIXME for a feature that is missing from OCaml 5 (avoid joining on the tick thread at shutdown).

Do not create a thread if caml_thread_new ends with an exception

6c79289

We change the order of spawning of the tick thread so that in case of (very improbable) failure, we do not end up with both an exception and a thread being spawned.

gadmm force-pushed the systhread_simpl_and_fixes3 branch from ae46a4a to 96833f9 Compare July 18, 2023 16:29

gadmm and others added 3 commits July 18, 2023 18:38

[minor] Some documentation and FIXMEs for Unix.fork

f08df41

Do not depend on exact error string in testsuite

ac33671

Changes

2569640

gadmm force-pushed the systhread_simpl_and_fixes3 branch from 96833f9 to 2569640 Compare July 18, 2023 16:40

kayceesrk merged commit 90dae66 into ocaml:trunk Jul 18, 2023

nojb mentioned this pull request Jul 27, 2023

Document that Unix.fork cannot be used after Domain.spawn #12435

Closed

Conversation

gadmm commented Jul 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jul 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadmm commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Jul 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jan 27, 2023

Uh oh!

gadmm commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Mar 6, 2023

Uh oh!

kayceesrk commented May 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kayceesrk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kayceesrk May 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayceesrk commented May 27, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gadmm commented Jul 1, 2022 •

edited

Loading

gadmm commented Jul 6, 2022 •

edited

Loading

gasche commented Jul 10, 2022 •

edited

Loading

gadmm commented Mar 6, 2023 •

edited

Loading

kayceesrk commented May 27, 2023 •

edited

Loading

kayceesrk left a comment •

edited

Loading

kayceesrk May 27, 2023 •

edited

Loading

gadmm commented Jul 18, 2023 •

edited

Loading