Skip to content

Simplifications and fixes to multicore systhreads implementation (2/3)#11385

Merged
abbysmal merged 4 commits intoocaml:trunkfrom
gadmm:systhread_simpl_and_fixes2
Oct 4, 2022
Merged

Simplifications and fixes to multicore systhreads implementation (2/3)#11385
abbysmal merged 4 commits intoocaml:trunkfrom
gadmm:systhread_simpl_and_fixes2

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Jul 1, 2022

Follow-up to #11271 (cc @Engil, @kayceesrk).

Targets 5.0 since this appears to fix domain shutdown. Details in the individual commit logs.

No change entry needed.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 3, 2022

I have rebased this PR which now lives on top of #11390. Thus only the two topmost commits belong to it.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 4, 2022

@gadmm This can now be rebased again.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 4, 2022

Precheck running at https://ci.inria.fr/ocaml/job/precheck/730/ .

@gadmm gadmm force-pushed the systhread_simpl_and_fixes2 branch from a4d624d to 102082b Compare July 4, 2022 12:08
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 5, 2022

CI fails on ppc64 and s390x with bytecode by timeout on some tests related to this PR. (The OpenBSD failure seems unrelated.) I managed to compile on ppc64 via qemu, but I did not manage to reproduce the failure. (I did not manage to run a z/s390x emulator, I tried.)

For lack of reproducibility I could not investigate the failure. Would anyone please help by reviewing this patch, in case I missed something?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 5, 2022

The following example shows what can go wrong:

let spawn_and_join_d f =
  let d = Domain.spawn f in
  Printf.printf "domain %n\n!" (Domain.get_id d :> int);
  Domain.join d

let spawn_and_join_t f =
  let t = Thread.create f () in
  Printf.printf "thread %n\n!" (Thread.id t);
  Thread.join t

let spawn_and_join f = spawn_and_join_d (fun () -> spawn_and_join_t f)

let _ =
  for i = 1 to 200 do spawn_and_join (fun () -> ()) done;
  spawn_and_join (fun () -> while true do () done)

On trunk, if you run ps -eLf while the code above is running, then you see that it has more than 200 threads concurrently. Probably, 200 tick threads causing context switches every 250µs (instead of 50ms) on domain 1. This PR fixes this behaviour.

If we do not manage to find the root of the failure on ppc64/s390x in time, for 5.0 we could consider instead a one-line fix here: gadmm@16a3174. It does not fix the stopping of tick threads but it prevents them from multiplying.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 5, 2022

Out of curiosity, I ran a precheck test on the current trunk: https://ci.inria.fr/ocaml/job/precheck/733/

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about this code and just had a look to see if I could help.

Some very naive questions. We seem to use two different domain-local-storage mechanisms at the same time, the st_tls_{get,set}(caml_thread_key) stuff and then Active_thread which looks up into a domain-local array.

  • Why do we need both?
  • Are they supposed to be always synchronized, or not?
  • Is this documented somewhere?

the thread currently executing */
Active_thread = st_tls_get(caml_thread_key);
/* Restore the runtime state from the curr_thread descriptor */
caml_thread_restore_runtime_state();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of this API change because we lose the clear bracketing between save_runtime_state and restore_runtime_state, I think it makes the code harder to read.

Maybe another API coudl like this:

  • here we only need to call restore_runtime_state() (no argument)
  • restore_runtime_state is in charge of updating Active_thread and then calling restore_runtime_state_from on Active_thread
  • set_active calls restore_runtime_state_from as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this could be clarified.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, please have a look

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit ambivalent on this part, but I don't quite remember the previous iteration.
I think the current proposal is fine however, do you have any update on your comments @gasche ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite remember the previous iteration.

In case the old one is no longer visible from github, I think the new version addresses @gasche's criticism and the result is better (without losing what I intended to do).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 6, 2022

Why do we need both?

One is thread-local, the other is domain-local. The GC needs to know which is the active thread (e.g. from the backup thread).

Are they supposed to be always synchronized, or not?

The question does not make much sense because there are several threads in a domain. At every safe point, if a thread has the lock and the thread info is valid (i.e. not being destroyed or created) then Active_thread holds the info of this thread.

Is this documented somewhere?

Some commits here and elsewhere try to use this_thread instead of Active_thread when the two coincide. This makes the relationship clearer because we no longer use Active_thread as a proxy for this_thread, only when we need to change their relationship.

@gadmm gadmm force-pushed the systhread_simpl_and_fixes2 branch from 102082b to 0f38d34 Compare July 6, 2022 13:29
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 6, 2022

After investigating failures I came up with a few more fixes to the systhreads implementation at #11403 which should be fixed first. I have just rebased on top of #11403. I do not know if this is enough to fix the timeouts on ppc/s390x, a precheck is welcome.
#11403 includes the quick fix I mentioned above to make this PR is a bit less urgent (can be delayed to 5.1 if the CI failure is not resolved). Indeed the impact of having tick threads still running after domain shutdown should be minimal.

To clarify which PRs should be considered first, I am converting this one to a draft.

(To be extra-clear: only the 3 topmost commits belong to this PR, and we will wait for #11403 to be merged before further reviewing this one, though a round of Inria CI check is welcome in the meanwhile.)

@gadmm gadmm marked this pull request as draft July 6, 2022 13:38
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 10, 2022

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 22, 2022

#11403 is now merged. Are there bits from here that you want to suggest for 5.0? (Do we need to figure out how to initialize the thread machinery from C first, to solve the Thread.self issue?)

@gadmm gadmm force-pushed the systhread_simpl_and_fixes2 branch from 0f38d34 to d673b25 Compare July 23, 2022 21:57
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 23, 2022

I want to suggest that all the systhreads PRs should be reviewed and issues fixed for 5.0, including the several FIXMEs I have included in the PR 3/3, but it is up to you.

According to your precheck at https://ci.inria.fr/ocaml/job/precheck/742/, it is likely that this version still fails on ppc/s390x. Again, I am looking for a volunteer to take over the systhreads PRs as I lack the time to take care of them.

I have rebased it on top of the many changes and eventual non-changes to make it easier for you (which took longer than I hoped).

@gadmm gadmm marked this pull request as ready for review July 23, 2022 22:07
Copy link
Copy Markdown
Contributor

@abbysmal abbysmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is looking fine, I had two comments to make, otherwise I think the changes and renaming are sensible.
I would like a second opinion from @gasche but I think once the comments are addressed we can merge this one.

pthread_sigmask(SIG_SETMASK, &old_mask, NULL);
#endif

//FIXME: RM
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RM?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to a minor resource management issue fixed in the PR n°3/3. I have cleaned up this line now.

@gadmm gadmm force-pushed the systhread_simpl_and_fixes2 branch from d673b25 to 64ca9d1 Compare August 2, 2022 17:53
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Aug 2, 2022

A small reminder to run precheck since I have no reason to think that the failure for ppc/s390x has been fixed.

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 2, 2022

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Aug 2, 2022

The test still fails on ppc/zsystems in what appears to be a deadlock. The only commit that changes behaviour is the one fixing thread stop, so something is fishy. I will have to let someone else investigate this tick thread issue. Reminder that:

I leave this PR open in case the problem is fixed separately and the PR is still relevant afterwards.

@abbysmal
Copy link
Copy Markdown
Contributor

@gadmm I believe the issue with s390x is no more following #11550, shall I move on to review and possibly merge this PR?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 22, 2022

@Engil Congrats and thanks a lot for the investigation, I suspected that this was a pre-existing bug and was worried that it would forever remain a mystery. I agree that this likely makes it possible to proceed with this PR, let's see if it rebases well.

gadmm and others added 4 commits September 22, 2022 17:36
When we get there, we know that it is never the last thread. This
reveals that the tick thread is never shut down, which is fixed next.
We document an invariant and use a setter which enforces the
invariant. This factors the code at all the call points (same
behaviour).
@gadmm gadmm force-pushed the systhread_simpl_and_fixes2 branch from 64ca9d1 to 3ef4d3f Compare September 22, 2022 16:03
@gadmm gadmm requested a review from abbysmal September 22, 2022 16:40
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 23, 2022

I think this is ready for 5.0.

@abbysmal
Copy link
Copy Markdown
Contributor

I was off today, will review on monday, have a great weekend!

Copy link
Copy Markdown
Contributor

@abbysmal abbysmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I am concerned, I think this PR is a nice improvement on a few awkward parts of sys threads.
I'd like one more feedback from @gasche on a specific comment he made last time: I'm personally fine with the current version, but I'd like to know if he thinks otherwise.

the thread currently executing */
Active_thread = st_tls_get(caml_thread_key);
/* Restore the runtime state from the curr_thread descriptor */
caml_thread_restore_runtime_state();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit ambivalent on this part, but I don't quite remember the previous iteration.
I think the current proposal is fine however, do you have any update on your comments @gasche ?

@Octachron Octachron added this to the 5.0 milestone Oct 3, 2022
@abbysmal abbysmal merged commit dd5e82d into ocaml:trunk Oct 4, 2022
@abbysmal
Copy link
Copy Markdown
Contributor

abbysmal commented Oct 4, 2022

Cherry-picked 89940bf 83e385d 55074ef 3ef4d3f to 5.0

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 4, 2022

Thanks

@Octachron
Copy link
Copy Markdown
Member

I am curious: any reason to cherry-pick all the individual commits by hand rather than the PR merge commit? (Also I personally like to record the original commit with the -x flag)

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 4, 2022

I'ts not obvious how to cherry-pick a merge commit using git, different projects do it differently, but indeed our HACKING.adoc#cherry-pick recommends git cherry-pick -x -m 1 for this purpose. For your next cherry-pick :-)

@abbysmal
Copy link
Copy Markdown
Contributor

Ha, very sorry about this, I did forget to check HACKING.adoc and got slightly frustrated with git.
Will do better next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants