Skip to content

Implement quality treatment for asynchronous actions in multicore (2/N)#11190

Merged
sadiqj merged 10 commits intoocaml:trunkfrom
gadmm:multicore_async_actions_2
Jun 8, 2022
Merged

Implement quality treatment for asynchronous actions in multicore (2/N)#11190
sadiqj merged 10 commits intoocaml:trunkfrom
gadmm:multicore_async_actions_2

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Apr 13, 2022

This PR is another part of #11057 and comes after #11095.

This restores the polling behaviour in C code of OCaml 4 by delaying finalisers until the next suitable safe point.

This also removes races whereby signals can be forgotten.

Reviewing is a bit more involved than going commit-per-commit, see #11095 (comment). In particular, some FIXME comments are added about existing bugs, and not fixed in this PR but later ones.

cc @sadiqj, @gasche. (I think @stedolan, @jhjourdan can also be interested.)

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 14, 2022

I'm a bit sad to see the new user interface for Alloc_small (the "block" parameter) go away to be replaced by the weird Setup_for_gc and Restore_after_gc from before. I find the "new" interface easier to read, morally we are passing a higher-order parameter by position (so almost like a real function), instead of passing two partial pieces of code by name. This has the cost of some added redundancy, but maybe those could be hidden by some simpler functions re-exporting the caml_alloc_small_dispatch with common parameters (nallocs = 1, alloc_lens = NULL). The more frustrating aspect is the need to pass wosize to both Alloc_small itself and the gc handler, I don't know how to avoid a painful-in-practice redundancy there.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 14, 2022

So am I. What are you suggesting?

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 14, 2022

Maybe the following could work:

  • Alloc_small declares an internal variable wosize (this would also protect against multiple evaluation of the macro parameter)
  • then we can do Alloc_small(result, 1, tag, { caml_gc_dispatch_simple(wosize, CAML_FROM_C) } and it works

atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start);
if (caml_incoming_interrupts_queued()
dom_internal * d = &all_domains[dom_st->id];
if (atomic_load_relaxed(&d->interruptor.interrupt_pending)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caml_incoming_interrupts_queued was an acquire load, which is replaced by a relaxed load here. Don't we need an acquire load to properly synchronize with the STW leader?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need an acquire load. There is no subsequent read that it needs to synchronize with. The synchronisation we need is with the preceding read done by atomic_exchange.

achieves the proper synchronisation. */
atomic_exchange(&dom_st->young_limit, (uintnat)dom_st->young_start);
if (caml_incoming_interrupts_queued()
dom_internal * d = &all_domains[dom_st->id];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes seems to be paving the way to call caml_reset_young_limit at another domain, but I can't tell (neither in #11057) where that would happen. When would we need it? (Are you planning to handle cross-domain interrupts with this mechanism, by calling caml_resete_young_limit on the thread to interrupt?)

Copy link
Copy Markdown
Contributor Author

@gadmm gadmm Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"private communication" from @sadiqj.

I think this is a desired direction for the domains implementation, and the current style of the surrounding code. (I mean relying less on Caml_state but passing domain_state around.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this was me? I'm not sure I recognise that code.

Copy link
Copy Markdown
Contributor Author

@gadmm gadmm Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually a quote from @ctk21 which you forwarded to me. It was a generic remark about passing the domain state around rather than introducing calls to Caml_state (something about work stealing for STW sections, and also the cost of Caml_state on some platforms).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 14, 2022

@sadiqj I have an occasional failure of the test weak-ephe-final/finaliser2.ml as described here: #11057 (comment). Is there anything already known about this issue?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 14, 2022

I have considered keeping the "higher-order" style, but my conclusion was that it is not adapted to the situation. The usage was more adapted in multicore but this usage has now disappeared. There is no relationship with the old usage, for which I think it would make everything needlessly complicated (and not really higher-order). In particular I am worried about the multiplicity of parameters to Alloc_small in OCaml 4 to pass around. If you look at multicore usages, the duplication and non-locality of delicate code was less of a problem.

Now the old style was perfectly fine for OCaml 4 and what motivated the new style is no longer there, so I am happy with this PR.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 14, 2022

I just learnt that higher-order macros are possible, so this might after all give a nice simplification: https://godbolt.org/z/Wa16erbr4. However I feel like this improvement is orthogonal to the current PR, since this is an improvement over both the OCaml 4 style and the multicore style, and I prefer to preserve whatever steam I have left for the rest of the review process.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Apr 15, 2022

@sadiqj I have an occasional failure of the test weak-ephe-final/finaliser2.ml as described here: #11057 (comment). Is there anything already known about this issue?

@fabbing hit this on the frame pointers PR too. I think @jonludlam was going to take a look.

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in taking a look at this @gadmm and thanks for your effort in actually getting it done. For anyone else looking at the diff, large parts of the early commits are essentially code from 4.x (including the Setup_for_gc and Restore_after_gc) and while they could be improved I don't think we're in any worse position code-wise than we were before only now we benefit from asynchronous exception safety. This also gets us a lot closer to statmemprof.

Only a couple of minor points that it would be good to get clarification on which were left as comments.

There's an increased cost in leave blocking section but checking for signals is pretty cheap with Xavier's single word check. There's also an extra TLS though, which might be something to watch on some platforms.

@gasche were you planning to do a review?

@gadmm gadmm force-pushed the multicore_async_actions_2 branch from 5dd2744 to e5ae7dd Compare May 19, 2022 13:41
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 19, 2022

I rebased & added Changes. Eventually decided for the higher-order macro.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 19, 2022

I haven't looked at the rest, but I like the higher-order macro. It's much less scary than it sounds (the code does not look complex), and less verbose than reverting to the previous approach for most use-cases.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented May 25, 2022

I've taken a look and am happy with the macro changes. Could you resolve the conflict in domain.c (the eventring PR moves the EV_MINOR event out of caml_poll_gc_work so you can just remove it) and I can get this merged @gadmm ?

@gadmm gadmm force-pushed the multicore_async_actions_2 branch from e5ae7dd to 1fb559a Compare May 25, 2022 14:42
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 25, 2022

done.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented May 25, 2022

Well that's a curious testsuite failure. Why does this change the number of major collections and only on OS X?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 25, 2022

This PR includes a fix to gc_full_major_exn, but apart from that it does not ring any bell on my side, especially given the platform-specific behaviour.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 25, 2022

Could you please have a look? I am a bit busy right now and I am not familiar with this eventlog test, it might have to do with the test itself. If I follow correctly the problem is that there is an extra slice here and there at each major GC in a reproducible manner (12000 vs 8000 for 2000 full major GC running 3 major collections each = 0 to 1 extra slice per major collection). But I do not see why you check the number of slices in the first place, since it might not be a portable number or a number stable across changes to GC internals.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented May 25, 2022

I'll take a look. The test needs to be sufficiently low level because runtime events is fairly tied to the GC internals. I may have got that particular test wrong though.

@kayceesrk kayceesrk self-assigned this May 31, 2022
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jun 7, 2022

Is there any update on this? It would be nice to get to the part 3/N soon (where in fact N=4 but part 4/4 is not needed for 5.0). I strongly suspect that the failing test is not the fault of this PR, could it be merged and you look for the problem in parallel?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jun 7, 2022

Could you by chance rebase on the current trunk? This would restart the CI, which may help establishing that the OSX failure was flimsy.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jun 7, 2022

Sorry @gadmm and thanks for the poke, I got blocked on reproducing on OS X.

I want to understand why this test is only failing with this PR, I haven't seen it fail on any other setup yet.

@kayceesrk
Copy link
Copy Markdown
Contributor

I can recreate this test tests/lib-runtime-events/'test_caml.ml' failure on both Linux and macOS. Both fail similarly:

$ make one TEST=tests/lib-runtime-events/'test_caml.ml'
....
> -minors: 18000, majors: 8000
> +minors: 18000, majors: 12000

@sadiqj can you check whether the test fails on Linux.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jun 8, 2022

We do not seem to see this failure in the same way. I don't think it is "flimsy" in the sense that running it again will make the failure disappear. This PR could change minor details of GC (changes to gc_ctrl.c, changes to the scheduling of finalisers) that causes the numbers to change with this PR and not others, but this is irrelevant.

It seems to me that the result of the test could depend on irrelevant machine- and environment-specific differences. Whether there is one more or one fewer slice could depend on the size of the heap. And on my machine simply opening a toplevel in various directories and running Gc.stats () shows that the size of the heap can vary ~4% depending on the current working directory.

So here you get 3 slices per major instead of 2, but until proof of the contrary this does not point at a problem with my PR. Please disable the test and then I will rebase so that we can get CI green.

To write a reliable test you could try to compare numbers from ones obtained using GC timing hooks instead of using arbitrary numbers in the reference file.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jun 8, 2022

That test has been fixed on trunk now, if you rebase now @gadmm we should be good to go.

gadmm added 10 commits June 8, 2022 16:53
This commit does not change behaviour. It rephrases `Alloc_small` in a
way that is more flexible. We will need the additional flexibility
later on.

Prelude to removing the loop and generalizing `caml_reset_young_limit`
to other pending actions, which is necessary for delaying them.
We do this by reintroducing `caml_alloc_small_dispatch` from
pre-multicore trunk.

We can now update `young_limit` to its desired value without causing
infinite loops when delaying callbacks.

Prerequisite to delaying callbacks and reintroducing memprof.
This lets us reset [young_limit] _before_ running the asynchronous
callbacks, and avoids polling repeatedly for signals that are not for
us.
Fixes test `tests/weak-ephe-final/finaliser2.ml`.
@gadmm gadmm force-pushed the multicore_async_actions_2 branch from 1fb559a to 617567c Compare June 8, 2022 14:58
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jun 8, 2022

Thanks, done. Let's keep fingers crossed.

@sadiqj sadiqj merged commit cbcb507 into ocaml:trunk Jun 8, 2022
@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jun 8, 2022

Looks good. Thanks for the work and the prod to get it merged @gadmm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants