Simplifications and fixes to multicore systhreads implementation (3/3)#11386
Simplifications and fixes to multicore systhreads implementation (3/3)#11386kayceesrk merged 9 commits intoocaml:trunkfrom
Conversation
c4228dc to
40e25fc
Compare
|
I have rebased on top of #11393, so keep in mind that only the 5 topmost commits belong to this PR. |
otherlibs/systhreads/st_stubs.c
Outdated
|
|
||
| /* Cleanup the thread machinery when the runtime is shut down. Joining the tick | ||
| thread take 25ms on average / 50ms in the worst case, so we don't do it on | ||
| program exit. (FIXME) */ |
There was a problem hiding this comment.
I restored this comment from old systhreads. However, this behaviour does not seem implemented yet in multicore.
b2d65d7 to
e5a8f1e
Compare
|
I rebased on top of #11385 and added some new minor fixes and clarifications. It still only contains less important fixes targeting 5.1. For clarity I convert it as a draft, we should review it when earlier PRs have been merged. I added some more FIXMEs that I think are worth looking into. (of course a round of precheck is welcome) |
|
In case anyone wonders (I just did), the current "PR path" appears to be:
|
|
This should be rebased and made ready for review. While adapting to the new way systhread is initialized on every thread, I stumbled into a rabbit hole regarding error handling inside |
e5a8f1e to
9b109bf
Compare
|
I have rebased on top of trunk. This could do with a round of review. Instructions for reviewers:
|
|
Do we have a volunteer to review this PR? I have assigned myself as "person in charge of making sure we don't forget about it completely", but I would be happy to let someone else do the review. |
|
I'm starting to review this PR, but I have a couple of difficulties in the way this PR and others like #11307 are organised which is making it harder for me to review the PR(s). My main difficulty is that the PRs suggest many improvements that seem independent but are organised as several commits, each of which has to be read one after the other. The details of the improvements are in the commit messages. This has several downsides that make it harder to review such PRs.
Personally, I really find it hard to review PRs commit by commit. May I recommend @gadmm to consider having separate PRs for each improvement so that it makes reviewing easier? Small improvements will go much faster with a PR per commit. If they are inter-related, please mention that it builds on top of another PR. Also, possibly consider marking the PRs that are targets in a dependence chain as drafts so that reviewer time is not wasted on reviewing PRs which may change based on whether the source PRs require changes. Reviewer time is precious and limited, and it would help if the PRs are organised to be reviewer friendly. |
| out_err2: | ||
| pthread_mutex_destroy(&m->lock); | ||
| out_err: | ||
| return rc; |
There was a problem hiding this comment.
No change necessary, but noting that since we don't know which operation resulted in the non-0 return value, all we can conclude is that 0 is success and non-0 is an error. In particular, one cannot usefully inspect the error number.
There was a problem hiding this comment.
I think we can simplify this by thinking about which errors would be useful to ultimately signal to the mutator -- there can only really be Out_of_memory or another exception to signal that a thread cannot be created (to distinguish between memory exhaustion and other system limits like thread limits).
In this function, the only way the pthread_mutex_init or pthread_mutex_cond functions can fail is with a memory or resource error (EAGAIN or ENOMEM), since the functions aren't called in a way that could return EINVAL. So just returning a negative error code of -1 to indicate an OOM condition would be simpler than propagating the results of one of the mutex calls failing, I think.
There was a problem hiding this comment.
It is true that you do not know which one failed, but e.g. in case of ENOMEM this does not really matter.
The docs also mention EBUSY and the function is indeed used to reinitialize a mutex after fork, so since what is done for fork+threads is not really standards-compliant, it is best to keep the explicit error message for debugging. After spending an hour trying to figure out the best course of action here, I am thinking we might be overthinking it.
Raising an exception à la sync_check_error (which correctly maps ENOMEM to Out_of_memory) is doable, after fixing the handling of errors during domain creation.
otherlibs/systhreads/st_stubs.c
Outdated
| m->init = 0; /* force initialization */ | ||
| st_masterlock_init(m); | ||
| m->init = 0; /* force reinitialization */ | ||
| if (0 != st_masterlock_init(m)) |
There was a problem hiding this comment.
Is 0 != st_masterlock_init(m) better than st_masterlock_init(m) != 0?
There was a problem hiding this comment.
The house style in the runtime is mostly the latter, with only a few sprinklings of the (0 == ...) style that I can find in runtime/sys.c
There was a problem hiding this comment.
Thanks, fixed. Comments about style are welcome, to help fix the inconsistency of the code base.
| st_masterlock_init(m); | ||
| m->init = 0; /* force reinitialization */ | ||
| if (0 != st_masterlock_init(m)) | ||
| caml_fatal_error("Unix.fork: failed to reinitialize master lock"); |
There was a problem hiding this comment.
I can see that the proposed change is better than what was earlier. Is there no better recovery here than a caml_fatal_error?
There was a problem hiding this comment.
Is even caml_fatal_error safe here, given the earlier giant "only execute async-signal-safe operations" comment in the diff hunk above? If we've failed to do something as simple as allocate a few mutexes at this point in the fork, just exiting is probably the only option.
There was a problem hiding this comment.
Indeed if it fails, there is not much that can be done at that point. In case there is something to improve to the "best effort" to support fork+threads, it is still better to attempt writing an error message. Note that when trying to support fork+threads, one is already deep in unchartered territory (with some hope that the global lock ensures enough consistency to let you ignore the "only async-signal-safe" requirement). Of course, the whole enterprise is doomed to failure if there are other C threads and one is interrupted in the middle of malloc, say.
otherlibs/systhreads/st_stubs.c
Outdated
| } | ||
|
|
||
| static void caml_thread_domain_initialize_hook(void) | ||
| /* Returns 0 if initialization failed */ |
There was a problem hiding this comment.
This semantics is the opposite of the semantics of other thread operations like pthread_mutex_init and pthread_cond_init where 0 is success and non-zero is an error. Wondered whether there was a reason to do it this way?
There was a problem hiding this comment.
Agreed with @kayceesrk. It's the opposite of the convention established in st_masterlock_init in this diff.
There was a problem hiding this comment.
The reasoning was that the other ones are internal, while this one is exported in some sense (my view was probably influenced by hacking heavily the runtime hooks in the past).
The new version raises an exception instead. This is mostly useful when initializing threads on the first domain, which was already prepared to receive an exception.
When fixing domain_thread_func to handle errors, this will need to be changed to returning an encoded exception.
There was a problem hiding this comment.
Since I feel that my comment was not very clear: I am backtracking a bit on returning an error code and I keep the exception-raising behaviour of caml_stat_alloc. I add a check to raise an exception on faulty mutex initialization. This lets me improve the behaviour when initializing threads on the first domain, without fixing or making worse the behaviour on initialization for other domains. This last fix will have to come in another PR (as mentioned already).
|
|
||
| /* Cleanup the thread machinery when the runtime is shut down. Joining the tick | ||
| thread take 25ms on average / 50ms in the worst case, so we don't do it on | ||
| program exit. (FIXME: not implemented in OCaml 5 yet) */ |
There was a problem hiding this comment.
IIUC, the fix is to not join the systhread since it might be sleeping? I noticed that the commit messages confirms this.
There was a problem hiding this comment.
Note, as I mentioned in #11386 (comment), that not joining is more problematic for multicore, and I suggested the use of pthread_cond_timedwait with a condition variable used to request continuing or shutting down as a possible solution.
| | _ -> print_endline "NOK" | ||
| end; | ||
| Domain.join d | ||
| let res = match Unix.fork () with |
There was a problem hiding this comment.
This is much better. Thanks!
|
Except for this one comment on the second commit (142beb8#r1207685036), the |
otherlibs/systhreads/st_stubs.c
Outdated
| /* First initialise the systhread chain on this domain */ | ||
| caml_thread_domain_initialize_hook(); | ||
| if (!caml_thread_domain_initialize_hook()) | ||
| caml_failwith("caml_thread_initialize: Thread initialization failed."); |
There was a problem hiding this comment.
It would be really nice to distinguish between an OOM (Out_of_memory) and a thread allocation fail (Failure) here, if possible.
There was a problem hiding this comment.
Wish granted. Note that properly dealing with those exceptions for domains ≥ 1 will be more work, as mentioned.
otherlibs/unix/unix.mli
Outdated
| process, the pid of the child process for the parent process. It | ||
| fails if the OCaml process is multi-core (any domain has been | ||
| spawned). In addition, if any threads have been spawned then the | ||
| child process might be in a corrupt state. |
There was a problem hiding this comment.
This isn't entirely clear on how other threads have been spawned, given the invariant earlier. Do you mean system threads independently spawned of the OCaml runtime?
There was a problem hiding this comment.
Yes, I think this sentence refers to pthreads that would have been created by C code or other foreign code in the current process.
There was a problem hiding this comment.
Also systhreads, which is what I meant originally. C/FFI programmers already know not to mix fork & threads. I clarified the comment.
|
@gadmm it looks like the present PR has been reviewed three weeks ago, and would be mergeable in trunk after you rebase and take review comments into account. Are you available to do this in the next few weeks? If not, would you prefer if a maintainer took care of the rebase? |
|
As I understand it, this is not scheduled for 5.1 so there is no rush. |
It would be useful to get this towards a merge soon as the code has already gone out of sync with trunk and needs to be rebased. Merging it faster would mean that fixing conflicting changes would be the responsibility of folks who propose newer PRs ;-) |
|
Thanks for the gentle pressure, I am well aware and this has been at the top of my pile. As you can guess my own time is constrained, and I also received many solicitations from the OCaml github concurrently. I have been prioritizing 5.1. (Even though the release schedule is kept private for some odd reason, I have some sense that 5.1 is the priority at this time.) Still on the topic of what we could do better to make sure that aging PRs end up getting merged, would monthly reminders be received positively? Would they help? |
ef90b1c to
ae46a4a
Compare
|
The "fixup" comments are for reviewing my fixes. They will be squashed later. |
Force push lost my comment ?! :-( I don't know what my comment was, but I am assuming that it was fixed. If @gadmm can confirm, I think the PR is good to go. |
|
The comment is still there right above (#11386 (comment)), along with my reply, but it looks like your pointer was invalidated. (I got the still-valid link by doing "copy link" directly from the PR discussion, which seems to give a different URL than yours.) Let me know about the last point, and then I will clean-up the PR. |
LGTM. Please also add a changes entry. I'll aim to merge this right after that. |
This revealed further issues with the new way domains are initialized, hence the addition of a FIXME.
In this code path, if Thread_lock(Dom_c_threads) is well-defined, then Active_thread is non-null since both variables are initialized at the same moment. As with the previous one, this should be looked at to check if this does not hide a more serious bug.
The tick thread no longer uses signals.
Take advantage of the fact that the domain_state is passed as an argument
…have been forgotten Add a FIXME for a feature that is missing from OCaml 5 (avoid joining on the tick thread at shutdown).
We change the order of spawning of the tick thread so that in case of (very improbable) failure, we do not end up with both an exception and a thread being spawned.
ae46a4a to
96833f9
Compare
96833f9 to
2569640
Compare
|
Rebased & added a Changes entry (mostly to credit the work done by reviewers, as no change entry was needed for these minor changes & fixes IMO). Edit: as already mentioned in a different discussion, the general advice on the organisation of this PR, which mentions various points that were already applied, and did not seem related to the delay in starting the review, ended up discussed in private. |
Follow-up to #11271, #11385 and #11393 (cc @Engil, @kayceesrk).
Targets 5.1 (does not seem to fix important bugs). Details in the individual commit logs. Can you please have a look at the commit that removes dead code, in case this was unintended and hides a bug? I also added a FIXME regarding a behaviour from old systhreads that is not yet implemented in multicore (see comments below).
This PR lives on top of #11393, so only the 5 topmost commits belong to it.
No change entry needed.