Skip to content
This repository was archived by the owner on Jun 21, 2024. It is now read-only.

Implementing the threads library with Domains #342

Closed
abbysmal wants to merge 3 commits intoocaml-multicore:parallel_minor_gcfrom
abbysmal:parallel_minor_gc_4_10+dthreads
Closed

Implementing the threads library with Domains #342
abbysmal wants to merge 3 commits intoocaml-multicore:parallel_minor_gcfrom
abbysmal:parallel_minor_gc_4_10+dthreads

Conversation

@abbysmal
Copy link
Copy Markdown
Collaborator

@abbysmal abbysmal commented May 22, 2020

As previously mentioned on #240, here is a rebase of @jhwoodyatt's implementation of the Thread library in terms of Domains.

Some bits in the rebase are still not perfect, but it is in good enough conditions that the build performs correctly and the testsuite now runs.

@jhwoodyatt

Thank you again for your contribution to Multicore OCaml. We will try our best to review and integrate your contribution as soon as possible to OCaml Multicore main development branch.

As the rebasing work is now in mostly good condition, I will work next on assessing the testsuite results.
I will as well work conjointly on #241, in order to simplify testing this contribution against the wider OCaml ecosystem.

Original message from #240:

This is a request for review and comment on a candidate for implementing the Threads library in terms of the new Domain and and Atomic modules in the Multicore OCaml standard library. This change, when combined with my other patch for Unix.system to not use Unix.fork makes the stock Dune 1.9.2 package compile and run properly on Multicore OCaml. (Or at least, it did until something in the master branch related to Lazy values introduced a regression.)

Some caveats are worth noting up front:

Firstly, I've made a half-hearted attempt at introducing this as a third library without overwriting the existing vmthread and systhread implementations. I noticed this decision is visible when trying to compile the ounit external package, so I'm anticipating the need to revisit some of these interoperability concerns.

Secondly, I've tried to improve the assurance of correctness by using the unit tests, but unfortunately, this pull request does not currently pass all of them reliably, and moreover, the only reason it can pass most of them at all is that I inserted a call to Gc.minor in the blocking section entrance hook. Obviously, that's a suboptimal workaround. Also, the swapchan unit test looks to my eye like it has a concurrency error of its own, and can only pass reliably if the Thread library is expected to have some guarantee about context switches that I don't think it actually promises. In short, I think it only accidentally passes reliably in the mainline monocore distribution.

(This introductory message is updated from its previous content.)

@jhwoodyatt
Copy link
Copy Markdown

Would you be so kind as to add me as a reviewer or an assignee (whatever is the best policy)?

@kayceesrk
Copy link
Copy Markdown
Contributor

Done @jhwoodyatt.

@abbysmal
Copy link
Copy Markdown
Collaborator Author

Some updates regarding the work on this PR:
I completed the rebasing work, while still not perfect the build process is now cleaner and streamlined.

Following #241, and the subsequent work to implement Unix.system without fork (see #344, #346 and ocaml/ocaml#9573)
I am also maintaining temporarily a branch combining both this and the aforementioned changes.
https://github.com/Engil/ocaml-multicore/tree/parallel_minor_gc_4_10%2Bdthreads%2Bunix_system_no_fork
(this is for the sake of providing a nicer test bed for this PR, however this branch will disappear once #346 is merged.)

I didn't investigate in detail so far the test results.

The CI will likely catch up, but for now my observations are the following for the lib-threads testsuite:

  • The signals.ml test case in lib-threads seems to still pass, however it is unclear if it is reliable.
  • The delayintr.ml test case, does not pass. There is apparently a deadlock, that the Linux scheduler sometimes catches on my machine, sometimes not, resulting in a infinite loop.
  • The sockets.ml test case does not reliably pass, resulting in deadlocks from time to time.
  • The fileio.ml test case does not pass as well, on the branch combining this change and the Unix.system change.

The next step for me is to investigate each aforementioned issues.
I will report any progress in due time.

@kayceesrk
Copy link
Copy Markdown
Contributor

Thanks for the update @Engil. Do you want one of us to start looking at these issues right now? (asking since the PR is still marked as a draft).

@abbysmal abbysmal force-pushed the parallel_minor_gc_4_10+dthreads branch from aade55d to e315ffb Compare May 28, 2020 13:53
@abbysmal abbysmal changed the title Parallel minor gc 4 10+dthreads Implementing the threads library with Domains May 28, 2020
@abbysmal abbysmal marked this pull request as ready for review May 28, 2020 13:56
@abbysmal
Copy link
Copy Markdown
Collaborator Author

abbysmal commented May 28, 2020

#346 is now merged.
I rebased this branch against it, so it should now be ready for review.
I will keep investigating the failing test cases, and will provide as well a review of the implementation once progress is made on that front.

@jhwoodyatt Please let us know if you have specific need (I think you are now correctly set as a reviewer as well), or specific notes about your contribution. :)
You did mention on #240 that the implementation was not quite ready, do you know of any particular shortcomings to keep in mind while reviewing?

Have a great day,

Edit:
#240 being superseded by this PR, I closed the original.
The top message here was edited to contain the original introductory message in the previous PR.

@abbysmal
Copy link
Copy Markdown
Collaborator Author

abbysmal commented Jun 3, 2020

I spent the last few days investigating the failing test cases I mentioned earlier.

  • lib-thread/delayintr.ml
    This test is apparently failing due to a bad interaction between Interlock, the blocking_section hooks setup by Thread and the signal handling.
    The result causes the domain to deadlock trying to acquires the Interlock.
    The flow I've witnessed is:
    • The thread calls wait_until here and waits for completion:
    • However a signal is later sent to the program, on which the testcase has a sighandler for: https://github.com/ocaml-multicore/ocaml-multicore/blob/e315ffbf4338629f51697b34e1a38a7027175bd0/testsuite/tests/lib-threads/delayintr.ml
    • The first behavior mismatch is that the domain does not handle the signal directly, rather, the sighandler will only be executed when wait_until is over. (thus causing the test case to fail, as the test case expects the signal to be handed "early". (ie. before the end of the sleeping period induced by the delay call.)
    • The second behavior mismatch, and source of the deadlock, is that after the suspend critical section is over, the sighandler is executed without the resume call ever being executed. Introducing an inconsistent state where the Gate's flag is still ready.
    • The sighandler is executed, and calls exit.
    • The call to exit triggers a new blocking section (while flushing the file descriptors), and the enter_blocking_section hook is called. A hook is indeed setup by Thread here:
      Callback.register "threads_enter_blocking_section" enter;
    • The hook in the enter_aux function specifically tries to leave the interlock, which triggers the following assertion:
      assert (g.flag != Ready);
      (remember, the resume call never happened, as such this invariant is currently broken)
    • The enter_blocking_section hook as such fails, and the blocking_section is then executed, leading to the effective flush of the file descriptors.
    • Then, the exit blocking section hook is called:
      if not (Interlock.obtain ()) then raise Domain.Sync.Retry;

      Leading to a deadlock where the domain tried to acquire a lock that was never released earlier due to the assertion failure being raised earlier.
  • lib-systhreads/testpreempt.ml
    This case is a bit more straightforward. The test fails because only one message is showed. (the one after the "computation" is completed.). The two previous messages ("Interactions") are not printed, because the assumption that the calls to Sys.time would cause preemption in the main computation code doesn't hold in this context: the delayed threads are never executed again.
  • lib-threads/fileio.ml
    • This test does not reliably fail, but can be reproduced with enough tries using rr -h.
    • The gist is that there's is a possible deadlock where a consumer thread can be blocking on a read operation, awaiting for a producer thread to write onto the pipe.
    • However, there is a possible interaction where the producer thread can be in the process of waiting for a stw to happen, which won't because one thread is blocked trying to read from the channel. Meaning the process is stuck spinning.

This is for the test I could investigate properly under this PR.
I am still trying to reproduce under rr the lib-threads/socket.ml test failure.

For updates regarding the current state of this work in the ecosystem, I think one major point still missing is proper signal handling.
Building dune >= 2.0 is currently not possible because Thread.sigmask is currently missing.
I think having a working prototype of #334 is the likely next target to be able to run this PR in the wild, as well as solving a few issues with potential signals mishaps in this implementation.

I will try to reproduce the socket test failure and investigate the underlying problem.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Jun 12, 2020

One thing that was brought to our attention recently is that there are reasonable industrial use cases where a systhread is spawned per file descriptor. Hence, it is not unreasonable to see 10k threads alive at the same time. This would no longer be possible with this PR since the maximum number of domains alive at the same time is 128. @avsm @jonludlam

@edwintorok
Copy link
Copy Markdown
Contributor

edwintorok commented Jun 12, 2020

One of those use cases might be the XAPI project. Looking at an otherwise fairly idle system with 1 VM running XAPI has 59 threads, and xenopsd has 48, a more busy system will have more.
We support up to 4096 VMs managed by a single xapi daemon, and (had it not been a 1024 FD hardcoded limit due its use of select) you could envision trying to run an API call related to each of those VMs in parallel, at least on the externally facing part, internally it could do some throttling/queuing if needed.

OTOH Dom0 only has 16 vCPUs currently, so if the limit is 128 "concurrently executing OCaml threads" then it wouldn't affect us for a long time (as long as we can actually create >128 threads, without multicore the limit is 1, so 128 would be an improvement)

Also there are systems with a large number of cores, e.g. 448 logical CPUs, so it is not unreasonable to expect for someone wanting to use all of that in an application (and future systems may have more).
Having a hardcoded limit on the number of threads for a system designed to scale to multicore would be best avoided.

Where does the 128 limit for Domains come from, is there an O(domains)/O(log(domains)) algorithm somewhere?

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Jun 16, 2020

@edwintorok we're going to implement systhreads as multiple pthreads that service a single domain. The downside of having domains stand-in for systhreads is that a domain is more than just a c stack; it has an associated minor heap. All the domains need to participate in the minor GC as well as the major GC cycling. Hence, after some internal discussion we're going to do #357. This does not have any restriction on the number of systhreads that can be spawned.

Where does the 128 limit for Domains come from, is there an O(domains)/O(log(domains)) algorithm somewhere?

128 comes from our minor heap layout for concurrent minor collector [1]. The current parallel minor collector does not have this restriction, but still carries the vestiges of the concurrent collector (which will be removed later). Still the fact that each of the domains will have to synchronize for minor GC and major GC cycling still stands. Hence, #357 is definitely a better idea.

[1] https://arxiv.org/abs/2004.11663

@jhwoodyatt
Copy link
Copy Markdown

I think having a working prototype of #334 is the likely next target to be able to run this PR in the wild, as well as solving a few issues with potential signals mishaps in this implementation.

Thanks so much for looking into this for me! I think I agree. #334 is an important feature that will make a difference in how this change can progress.

@jhwoodyatt
Copy link
Copy Markdown

Hence, after some internal discussion we're going to do #357. This does not have any restriction on the number of systhreads that can be spawned.

Okay. I have no objection to that.

@kayceesrk
Copy link
Copy Markdown
Contributor

Closed in favour of #407.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants