Simplifications and fixes to multicore systhreads implementation (2.5/3)#11393
Simplifications and fixes to multicore systhreads implementation (2.5/3)#11393gadmm wants to merge 4 commits intoocaml:trunkfrom
Conversation
|
No change entry needed. It passes the test suite on Alpine on my machine, but how about a round of precheck? |
|
I would welcome an explanation for non-experts of what the issue was and how/why we missed it the first time. |
When we get there, we know that it is never the last thread. This reveals that the tick thread is never shut down, which is fixed next.
We document an invariant and use a setter which enforces the invariant. This factors the code at all the call points (same behaviour), except in one place where it fixes something: the tick thread is correctly shut down when a domain stops.
1. Use a thread-local variable this_thread instead of explicit pthread TLS. This lets us avoid circonvolutions. 2. Prefer using this_thread over Active_thread (trivial). 3. Never access Caml_state while the domain lock is not held. The main motivation is 3. (in preparation of a subsequent PR). But the fact that this is done by simplifying the code is nice too.
6c8401c to
7826b07
Compare
|
Precheck run: https://ci.inria.fr/ocaml/job/precheck/731/ |
|
As shown by @gasche's precheck, it turns out that there were several separate issues that did not show up in the Github CI (and would show up or not on the INRIA CI). The failure fixed in this PR was a programming error in #11271 (the first one which I fixed, because it was the easiest to reproduce). Two other pre-existing issue in the systhread implementation might explain the ppc/s390x deadlocks (but this is not clear yet), and another bug is Windows-specific (hitting a corner-case of platform incompatibility). While looking for the bugs I also found other fixes/FIXMEs which I added to #11386, which I think are not related and not the most urgent for 5.0 (but if you have time it would still be nice to review all PRs up to #11386 for 5.0). @gasche Since you are asking about the present PR, the diff above fixes a use-after-free bug in For the issues in general, a safe systems programming language would simply say here, I think, that the systhreads implementation is too complicated (i.e. struggling to deal with a doubly-linked list with multiple pointers into it which may or may not be aliased) and would force to implement things differently. It does not give a magic solution. I have proposed various simplifications in my various systhreads-related PRs, but I do not immediately see more drastic changes to the implementation to make it simpler in a way that correctness of systhreads more obviously follows from structure (nor did I look too hard because I think it is beyond the scope). Running precheck would have shown the presence of multiple failures, but the PR was not an obvious candidate for running precheck according to the The systhreads implementation is also some new code, with few experts available for reviewing. The PRs take a long time to be reviewed due to this lack of experts (a task that should not fall on the shoulders of a single person). There is no obvious take-away (apart from the fact that it has been a good thing to have more people looking at the code able/willing to become experts). But this incites to be careful about other parts of the multicore runtime which are in a similar situation, where some outstanding issues and PRs for 5.0 are seeing likewise little involvement and progress. |
|
Thanks for the detailed post-mortem, which I found interesting.
I definitely agree and I think this is in fact the main value of the current systhreads activity, beyond the welcome cleanups and fixes. On that note: are we careful enough to document this new understanding of the code? It's excellent that the knowledge of the codebase was spread from "mostly @Engil" to "@Engil and @gadmm with bits of @kayceesrk and others", but were these comprehension efforts followed by work to build a body of documentation to make it easier for newcomers to get into the codebase? I'm not trying to suggest that the answer is "no" (in fact there were already nice comments in the code, and you wrote more in your PR), but I think it's a good question to keep in mind and a continuous process that can always improve. Would you consider sending a comments-only PR to reify some parts of your newfound understanding of the codebase, or do you think that no such PR is necessary because the code is already documented enough? |
This PR lives on top of #11385 and it fixes the 3rd commit of #11271 (cc @Engil, @kayceesrk, @gasche). It targets 5.0 because it is a dependency of #11272.
Note that only the two topmost commits belongs to it. It was better on top of #11385 because that one already did some of the changes needed to fix the bug, and is itself needed for 5.0.
It was not possible to produce a meaningful history but for your convenience here is essentially the diff of the bugfix: