Simplifications and fixes to multicore systhreads implementation (2/3)#11385
Simplifications and fixes to multicore systhreads implementation (2/3)#11385abbysmal merged 4 commits intoocaml:trunkfrom
Conversation
3ee7847 to
a4d624d
Compare
|
I have rebased this PR which now lives on top of #11390. Thus only the two topmost commits belong to it. |
|
@gadmm This can now be rebased again. |
|
Precheck running at https://ci.inria.fr/ocaml/job/precheck/730/ . |
a4d624d to
102082b
Compare
|
CI fails on ppc64 and s390x with bytecode by timeout on some tests related to this PR. (The OpenBSD failure seems unrelated.) I managed to compile on ppc64 via qemu, but I did not manage to reproduce the failure. (I did not manage to run a z/s390x emulator, I tried.) For lack of reproducibility I could not investigate the failure. Would anyone please help by reviewing this patch, in case I missed something? |
|
The following example shows what can go wrong: let spawn_and_join_d f =
let d = Domain.spawn f in
Printf.printf "domain %n\n!" (Domain.get_id d :> int);
Domain.join d
let spawn_and_join_t f =
let t = Thread.create f () in
Printf.printf "thread %n\n!" (Thread.id t);
Thread.join t
let spawn_and_join f = spawn_and_join_d (fun () -> spawn_and_join_t f)
let _ =
for i = 1 to 200 do spawn_and_join (fun () -> ()) done;
spawn_and_join (fun () -> while true do () done)On trunk, if you run If we do not manage to find the root of the failure on ppc64/s390x in time, for 5.0 we could consider instead a one-line fix here: gadmm@16a3174. It does not fix the stopping of tick threads but it prevents them from multiplying. |
|
Out of curiosity, I ran a precheck test on the current trunk: https://ci.inria.fr/ocaml/job/precheck/733/ |
There was a problem hiding this comment.
I don't know anything about this code and just had a look to see if I could help.
Some very naive questions. We seem to use two different domain-local-storage mechanisms at the same time, the st_tls_{get,set}(caml_thread_key) stuff and then Active_thread which looks up into a domain-local array.
- Why do we need both?
- Are they supposed to be always synchronized, or not?
- Is this documented somewhere?
| the thread currently executing */ | ||
| Active_thread = st_tls_get(caml_thread_key); | ||
| /* Restore the runtime state from the curr_thread descriptor */ | ||
| caml_thread_restore_runtime_state(); |
There was a problem hiding this comment.
I'm not fond of this API change because we lose the clear bracketing between save_runtime_state and restore_runtime_state, I think it makes the code harder to read.
Maybe another API coudl like this:
- here we only need to call
restore_runtime_state()(no argument) restore_runtime_stateis in charge of updatingActive_threadand then callingrestore_runtime_state_fromon Active_threadset_activecallsrestore_runtime_state_fromas well
There was a problem hiding this comment.
I agree, this could be clarified.
There was a problem hiding this comment.
Fixed, please have a look
There was a problem hiding this comment.
I'm still a bit ambivalent on this part, but I don't quite remember the previous iteration.
I think the current proposal is fine however, do you have any update on your comments @gasche ?
There was a problem hiding this comment.
I don't quite remember the previous iteration.
In case the old one is no longer visible from github, I think the new version addresses @gasche's criticism and the result is better (without losing what I intended to do).
One is thread-local, the other is domain-local. The GC needs to know which is the active thread (e.g. from the backup thread).
The question does not make much sense because there are several threads in a domain. At every safe point, if a thread has the lock and the thread info is valid (i.e. not being destroyed or created) then Active_thread holds the info of this thread.
Some commits here and elsewhere try to use this_thread instead of Active_thread when the two coincide. This makes the relationship clearer because we no longer use Active_thread as a proxy for this_thread, only when we need to change their relationship. |
102082b to
0f38d34
Compare
|
After investigating failures I came up with a few more fixes to the systhreads implementation at #11403 which should be fixed first. I have just rebased on top of #11403. I do not know if this is enough to fix the timeouts on ppc/s390x, a precheck is welcome. To clarify which PRs should be considered first, I am converting this one to a draft. (To be extra-clear: only the 3 topmost commits belong to this PR, and we will wait for #11403 to be merged before further reviewing this one, though a round of Inria CI check is welcome in the meanwhile.) |
|
#11403 is now merged. Are there bits from here that you want to suggest for 5.0? (Do we need to figure out how to initialize the thread machinery from C first, to solve the |
0f38d34 to
d673b25
Compare
|
I want to suggest that all the systhreads PRs should be reviewed and issues fixed for 5.0, including the several FIXMEs I have included in the PR 3/3, but it is up to you. According to your precheck at https://ci.inria.fr/ocaml/job/precheck/742/, it is likely that this version still fails on ppc/s390x. Again, I am looking for a volunteer to take over the systhreads PRs as I lack the time to take care of them. I have rebased it on top of the many changes and eventual non-changes to make it easier for you (which took longer than I hoped). |
abbysmal
left a comment
There was a problem hiding this comment.
I think this PR is looking fine, I had two comments to make, otherwise I think the changes and renaming are sensible.
I would like a second opinion from @gasche but I think once the comments are addressed we can merge this one.
otherlibs/systhreads/st_stubs.c
Outdated
| pthread_sigmask(SIG_SETMASK, &old_mask, NULL); | ||
| #endif | ||
|
|
||
| //FIXME: RM |
There was a problem hiding this comment.
This refers to a minor resource management issue fixed in the PR n°3/3. I have cleaned up this line now.
d673b25 to
64ca9d1
Compare
|
A small reminder to run precheck since I have no reason to think that the failure for ppc/s390x has been fixed. |
|
The test still fails on ppc/zsystems in what appears to be a deadlock. The only commit that changes behaviour is the one fixing thread stop, so something is fishy. I will have to let someone else investigate this tick thread issue. Reminder that:
I leave this PR open in case the problem is fixed separately and the PR is still relevant afterwards. |
|
@Engil Congrats and thanks a lot for the investigation, I suspected that this was a pre-existing bug and was worried that it would forever remain a mystery. I agree that this likely makes it possible to proceed with this PR, let's see if it rebases well. |
When we get there, we know that it is never the last thread. This reveals that the tick thread is never shut down, which is fixed next.
We document an invariant and use a setter which enforces the invariant. This factors the code at all the call points (same behaviour).
64ca9d1 to
3ef4d3f
Compare
|
I think this is ready for 5.0. |
|
I was off today, will review on monday, have a great weekend! |
abbysmal
left a comment
There was a problem hiding this comment.
As far as I am concerned, I think this PR is a nice improvement on a few awkward parts of sys threads.
I'd like one more feedback from @gasche on a specific comment he made last time: I'm personally fine with the current version, but I'd like to know if he thinks otherwise.
| the thread currently executing */ | ||
| Active_thread = st_tls_get(caml_thread_key); | ||
| /* Restore the runtime state from the curr_thread descriptor */ | ||
| caml_thread_restore_runtime_state(); |
There was a problem hiding this comment.
I'm still a bit ambivalent on this part, but I don't quite remember the previous iteration.
I think the current proposal is fine however, do you have any update on your comments @gasche ?
|
Thanks |
|
I am curious: any reason to cherry-pick all the individual commits by hand rather than the PR merge commit? (Also I personally like to record the original commit with the |
|
I'ts not obvious how to cherry-pick a merge commit using git, different projects do it differently, but indeed our HACKING.adoc#cherry-pick recommends |
|
Ha, very sorry about this, I did forget to check HACKING.adoc and got slightly frustrated with git. |
Follow-up to #11271 (cc @Engil, @kayceesrk).
Targets 5.0 since this appears to fix domain shutdown. Details in the individual commit logs.
No change entry needed.