Implement quality treatment for asynchronous actions in multicore (2/N)#11190
Implement quality treatment for asynchronous actions in multicore (2/N)#11190sadiqj merged 10 commits intoocaml:trunkfrom
Conversation
|
I'm a bit sad to see the new user interface for Alloc_small (the "block" parameter) go away to be replaced by the weird |
|
So am I. What are you suggesting? |
|
Maybe the following could work:
|
| atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start); | ||
| if (caml_incoming_interrupts_queued() | ||
| dom_internal * d = &all_domains[dom_st->id]; | ||
| if (atomic_load_relaxed(&d->interruptor.interrupt_pending) |
There was a problem hiding this comment.
I think caml_incoming_interrupts_queued was an acquire load, which is replaced by a relaxed load here. Don't we need an acquire load to properly synchronize with the STW leader?
There was a problem hiding this comment.
We do not need an acquire load. There is no subsequent read that it needs to synchronize with. The synchronisation we need is with the preceding read done by atomic_exchange.
| achieves the proper synchronisation. */ | ||
| atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start); | ||
| if (caml_incoming_interrupts_queued() | ||
| dom_internal * d = &all_domains[dom_st->id]; |
There was a problem hiding this comment.
This changes seems to be paving the way to call caml_reset_young_limit at another domain, but I can't tell (neither in #11057) where that would happen. When would we need it? (Are you planning to handle cross-domain interrupts with this mechanism, by calling caml_resete_young_limit on the thread to interrupt?)
There was a problem hiding this comment.
"private communication" from @sadiqj.
I think this is a desired direction for the domains implementation, and the current style of the surrounding code. (I mean relying less on Caml_state but passing domain_state around.)
There was a problem hiding this comment.
Are you sure this was me? I'm not sure I recognise that code.
There was a problem hiding this comment.
It was actually a quote from @ctk21 which you forwarded to me. It was a generic remark about passing the domain state around rather than introducing calls to Caml_state (something about work stealing for STW sections, and also the cost of Caml_state on some platforms).
|
@sadiqj I have an occasional failure of the test weak-ephe-final/finaliser2.ml as described here: #11057 (comment). Is there anything already known about this issue? |
|
I have considered keeping the "higher-order" style, but my conclusion was that it is not adapted to the situation. The usage was more adapted in multicore but this usage has now disappeared. There is no relationship with the old usage, for which I think it would make everything needlessly complicated (and not really higher-order). In particular I am worried about the multiplicity of parameters to Alloc_small in OCaml 4 to pass around. If you look at multicore usages, the duplication and non-locality of delicate code was less of a problem. Now the old style was perfectly fine for OCaml 4 and what motivated the new style is no longer there, so I am happy with this PR. |
|
I just learnt that higher-order macros are possible, so this might after all give a nice simplification: https://godbolt.org/z/Wa16erbr4. However I feel like this improvement is orthogonal to the current PR, since this is an improvement over both the OCaml 4 style and the multicore style, and I prefer to preserve whatever steam I have left for the rest of the review process. |
@fabbing hit this on the frame pointers PR too. I think @jonludlam was going to take a look. |
47c4e00 to
5dd2744
Compare
There was a problem hiding this comment.
Apologies for the delay in taking a look at this @gadmm and thanks for your effort in actually getting it done. For anyone else looking at the diff, large parts of the early commits are essentially code from 4.x (including the Setup_for_gc and Restore_after_gc) and while they could be improved I don't think we're in any worse position code-wise than we were before only now we benefit from asynchronous exception safety. This also gets us a lot closer to statmemprof.
Only a couple of minor points that it would be good to get clarification on which were left as comments.
There's an increased cost in leave blocking section but checking for signals is pretty cheap with Xavier's single word check. There's also an extra TLS though, which might be something to watch on some platforms.
@gasche were you planning to do a review?
5dd2744 to
e5ae7dd
Compare
|
I rebased & added Changes. Eventually decided for the higher-order macro. |
|
I haven't looked at the rest, but I like the higher-order macro. It's much less scary than it sounds (the code does not look complex), and less verbose than reverting to the previous approach for most use-cases. |
|
I've taken a look and am happy with the macro changes. Could you resolve the conflict in |
e5ae7dd to
1fb559a
Compare
|
done. |
|
Well that's a curious testsuite failure. Why does this change the number of major collections and only on OS X? |
|
This PR includes a fix to |
|
Could you please have a look? I am a bit busy right now and I am not familiar with this eventlog test, it might have to do with the test itself. If I follow correctly the problem is that there is an extra slice here and there at each major GC in a reproducible manner (12000 vs 8000 for 2000 full major GC running 3 major collections each = 0 to 1 extra slice per major collection). But I do not see why you check the number of slices in the first place, since it might not be a portable number or a number stable across changes to GC internals. |
|
I'll take a look. The test needs to be sufficiently low level because runtime events is fairly tied to the GC internals. I may have got that particular test wrong though. |
|
Is there any update on this? It would be nice to get to the part 3/N soon (where in fact N=4 but part 4/4 is not needed for 5.0). I strongly suspect that the failing test is not the fault of this PR, could it be merged and you look for the problem in parallel? |
|
Could you by chance rebase on the current trunk? This would restart the CI, which may help establishing that the OSX failure was flimsy. |
|
Sorry @gadmm and thanks for the poke, I got blocked on reproducing on OS X. I want to understand why this test is only failing with this PR, I haven't seen it fail on any other setup yet. |
|
I can recreate this test $ make one TEST=tests/lib-runtime-events/'test_caml.ml'
....
> -minors: 18000, majors: 8000
> +minors: 18000, majors: 12000@sadiqj can you check whether the test fails on Linux. |
|
We do not seem to see this failure in the same way. I don't think it is "flimsy" in the sense that running it again will make the failure disappear. This PR could change minor details of GC (changes to gc_ctrl.c, changes to the scheduling of finalisers) that causes the numbers to change with this PR and not others, but this is irrelevant. It seems to me that the result of the test could depend on irrelevant machine- and environment-specific differences. Whether there is one more or one fewer slice could depend on the size of the heap. And on my machine simply opening a toplevel in various directories and running So here you get 3 slices per major instead of 2, but until proof of the contrary this does not point at a problem with my PR. Please disable the test and then I will rebase so that we can get CI green. To write a reliable test you could try to compare numbers from ones obtained using GC timing hooks instead of using arbitrary numbers in the reference file. |
|
That test has been fixed on trunk now, if you rebase now @gadmm we should be good to go. |
This commit does not change behaviour. It rephrases `Alloc_small` in a way that is more flexible. We will need the additional flexibility later on. Prelude to removing the loop and generalizing `caml_reset_young_limit` to other pending actions, which is necessary for delaying them.
We do this by reintroducing `caml_alloc_small_dispatch` from pre-multicore trunk. We can now update `young_limit` to its desired value without causing infinite loops when delaying callbacks. Prerequisite to delaying callbacks and reintroducing memprof.
Useful later on
This lets us reset [young_limit] _before_ running the asynchronous callbacks, and avoids polling repeatedly for signals that are not for us.
Fixes test `tests/weak-ephe-final/finaliser2.ml`.
1fb559a to
617567c
Compare
|
Thanks, done. Let's keep fingers crossed. |
|
Looks good. Thanks for the work and the prod to get it merged @gadmm |
This PR is another part of #11057 and comes after #11095.
This restores the polling behaviour in C code of OCaml 4 by delaying finalisers until the next suitable safe point.
This also removes races whereby signals can be forgotten.
Reviewing is a bit more involved than going commit-per-commit, see #11095 (comment). In particular, some FIXME comments are added about existing bugs, and not fixed in this PR but later ones.
cc @sadiqj, @gasche. (I think @stedolan, @jhjourdan can also be interested.)