Skip to content

Simplifications and fixes to multicore systhreads implementation (2.5/3)#11393

Closed
gadmm wants to merge 4 commits intoocaml:trunkfrom
gadmm:systhread_simpl_and_fixes2.5
Closed

Simplifications and fixes to multicore systhreads implementation (2.5/3)#11393
gadmm wants to merge 4 commits intoocaml:trunkfrom
gadmm:systhread_simpl_and_fixes2.5

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Jul 3, 2022

This PR lives on top of #11385 and it fixes the 3rd commit of #11271 (cc @Engil, @kayceesrk, @gasche). It targets 5.0 because it is a dependency of #11272.

Note that only the two topmost commits belongs to it. It was better on top of #11385 because that one already did some of the changes needed to fix the bug, and is itself needed for 5.0.

It was not possible to produce a meaningful history but for your convenience here is essentially the diff of the bugfix:

--- a/otherlibs/systhreads/st_stubs.c
+++ b/otherlibs/systhreads/st_stubs.c
@@ -351,10 +351,11 @@ static void caml_thread_domain_stop_hook(void) {
     /* another domain thread may be joining on this domain's descriptor */
     caml_threadstatus_terminate(Terminated(this_thread->descr));
 
-    caml_stat_free(this_thread);
-
     /* Stop tick thread */
     set_active(NULL);
+
+    caml_stat_free(this_thread);
+    this_thread = NULL;
   };
 }
 
@@ -390,8 +391,9 @@ CAMLprim value caml_thread_yield(value unit);
 
 void caml_thread_interrupt_hook(void)
 {
-  /* Do not attempt to yield from the backup thread */
-  if (caml_bt_is_self()) return;
+  /* Do not attempt to yield from the backup thread or during domain
+     shutdown. */
+  if (this_thread == NULL) return;
 
   uintnat is_on = 1;
   atomic_uintnat* req_external_interrupt =
@@ -457,11 +459,12 @@ static void caml_thread_stop(void)
 
   caml_threadstatus_terminate(Terminated(this_thread->descr));
 
-  /* The following also sets Active_thread to a sane value in case the
-     backup thread does a GC before the domain lock is acquired
-     again. */
-  caml_thread_remove_info(Active_thread);
-  st_masterlock_release(&Thread_main_lock);
+  st_masterlock *thread_lock = &Thread_main_lock;
+  /* Remove this_thread and free its data. This also sets
+     Active_thread to a sane value in case the backup thread does a GC
+     before the domain lock is acquired again. */
+  caml_thread_remove_info(this_thread);
+  st_masterlock_release(thread_lock);
 }

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 3, 2022

No change entry needed. It passes the test suite on Alpine on my machine, but how about a round of precheck?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 4, 2022

I would welcome an explanation for non-experts of what the issue was and how/why we missed it the first time.

gadmm added 4 commits July 4, 2022 14:06
When we get there, we know that it is never the last thread. This
reveals that the tick thread is never shut down, which is fixed next.
We document an invariant and use a setter which enforces the
invariant. This factors the code at all the call points (same
behaviour), except in one place where it fixes something: the tick
thread is correctly shut down when a domain stops.
1. Use a thread-local variable this_thread instead of explicit pthread
   TLS. This lets us avoid circonvolutions.

2. Prefer using this_thread over Active_thread (trivial).

3. Never access Caml_state while the domain lock is not held.

The main motivation is 3. (in preparation of a subsequent PR). But the
fact that this is done by simplifying the code is nice too.
@gadmm gadmm force-pushed the systhread_simpl_and_fixes2.5 branch from 6c8401c to 7826b07 Compare July 4, 2022 12:14
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 4, 2022

Precheck run: https://ci.inria.fr/ocaml/job/precheck/731/

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 10, 2022

As shown by @gasche's precheck, it turns out that there were several separate issues that did not show up in the Github CI (and would show up or not on the INRIA CI).

The failure fixed in this PR was a programming error in #11271 (the first one which I fixed, because it was the easiest to reproduce). Two other pre-existing issue in the systhread implementation might explain the ppc/s390x deadlocks (but this is not clear yet), and another bug is Windows-specific (hitting a corner-case of platform incompatibility). While looking for the bugs I also found other fixes/FIXMEs which I added to #11386, which I think are not related and not the most urgent for 5.0 (but if you have time it would still be nice to review all PRs up to #11386 for 5.0).

@gasche Since you are asking about the present PR, the diff above fixes a use-after-free bug in caml_thread_stop. It was accessing this_thread via a macro while it had been deleted via an alias. This situation had been correctly noticed and avoided in another part of the code but overlooked here. Then I reviewed the code to ensure it did not occur anywhere else. This convinced me however that the faulty commit was not doing "Simplifications" as its title said but made things more complicated. So eventually I am closing this PR. I started again and found a minimal change for #11272 which is a better fit (and passes precheck), and I moved other improvements to other PRs.

For the issues in general, a safe systems programming language would simply say here, I think, that the systhreads implementation is too complicated (i.e. struggling to deal with a doubly-linked list with multiple pointers into it which may or may not be aliased) and would force to implement things differently. It does not give a magic solution. I have proposed various simplifications in my various systhreads-related PRs, but I do not immediately see more drastic changes to the implementation to make it simpler in a way that correctness of systhreads more obviously follows from structure (nor did I look too hard because I think it is beyond the scope).

Running precheck would have shown the presence of multiple failures, but the PR was not an obvious candidate for running precheck according to the HACKING document (maybe this should be done more systematically for non-obvious runtime modifications?).

The systhreads implementation is also some new code, with few experts available for reviewing. The PRs take a long time to be reviewed due to this lack of experts (a task that should not fall on the shoulders of a single person). There is no obvious take-away (apart from the fact that it has been a good thing to have more people looking at the code able/willing to become experts). But this incites to be careful about other parts of the multicore runtime which are in a similar situation, where some outstanding issues and PRs for 5.0 are seeing likewise little involvement and progress.

@gadmm gadmm closed this Jul 10, 2022
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 11, 2022

Thanks for the detailed post-mortem, which I found interesting.

apart from the fact that it has been a good thing to have more people looking at the code able/willing to become experts

I definitely agree and I think this is in fact the main value of the current systhreads activity, beyond the welcome cleanups and fixes.

On that note: are we careful enough to document this new understanding of the code? It's excellent that the knowledge of the codebase was spread from "mostly @Engil" to "@Engil and @gadmm with bits of @kayceesrk and others", but were these comprehension efforts followed by work to build a body of documentation to make it easier for newcomers to get into the codebase? I'm not trying to suggest that the answer is "no" (in fact there were already nice comments in the code, and you wrote more in your PR), but I think it's a good question to keep in mind and a continuous process that can always improve. Would you consider sending a comments-only PR to reify some parts of your newfound understanding of the codebase, or do you think that no such PR is necessary because the code is already documented enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants