Decouple major slice from minor GCs#11750
Conversation
|
The test |
|
My understanding was that the domains did major GC work whilst blocked waiting for other domains during minor GC. This was mentioned to me as one reason for why the parallel minor collections don't perform worse than the previous concurrent minor collections. If that is the case then I would expect it to be worth keeping that behaviour. The idea is that it would do it's normal major slice at the half-way point as this PR proposes, and do extra major GC work whenever it was blocked waiting on things. The "credit" mechanism -- present in the old GC, not sure if its implemented yet in the 5 GC -- then automatically shortens the next normal slice to account for the extra work done. |
|
Great. Out of interest is there a credit mechanism like in the old GC so that the opportunistic work leads to doing less work at the next slice (or slices -- IIRC the old GC could spread the credit across multiple slices as well)? |
There is ocaml/runtime/caml/domain_state.tbl Lines 74 to 78 in a4a05dd Currently, the credit is not spread across multiple slices. This is to be done. |
I reproduced the test failure and it's a little weird (getting an |
|
About the credit mechanism, I'm working on a solution to #11589 that will subsume it. |
Thanks! That's good to know. |
sadiqj
left a comment
There was a problem hiding this comment.
To summarise and make sure I've understood this PR correctly:
- We now set a
young_triggerwhich is halfway through the minor heap (logic borrowed from 4.14) - The first domain to hit the
young_trigger:- Sets the major slice epoch to the current number of minor collections
- Interrupts all other domains
- Each domain does a major slice because the domain local major slice epoch is now less than the global one
I think the logic for doing so works and the implementation matches the above.
It would be nice to know why the GC pacing changes significantly for a few of the sandmark benchmarks though.
|
Thanks for the review. I'll do an analysis of |
| void caml_major_collection_slice(intnat howmuch) | ||
| { | ||
| caml_domain_state *d = Caml_state; | ||
| uintnat major_slice_epoch = atomic_load (&caml_major_slice_epoch); |
There was a problem hiding this comment.
Is there a reason for the SC ordering on this variable here and elsewhere and not a weaker one? (I am asking from the point of view that using the weakest ordering makes the code easier to understand since relaxed and acq/rel ordering are often self-documenting; when I see the use of SC ordering I need some lengthy comment explaining what is being synchronised and how.)
There was a problem hiding this comment.
For code that is not performance sensitive, my approach had been to pick the ordering that doesn't require complicated reasoning. Hence the choice of SC atomics there. When I see acquire/release atomics my takeaway would be that "someone has tried to optimise this and so this must be performance sensitive". What approach do we want to take in the runtime code. @xavierleroy?
FWIW release/acquire ordering should be sufficient here.
There was a problem hiding this comment.
I'm in agreement with @kayceesrk here, when in doubt (and it isn't somewhere very hot) then SC is fine. We should err to be too strong than too weak.
There was a problem hiding this comment.
I think the issue is that it is not obvious to me what ordering you need to reason that the code is correct. (For instance, is it innocuous to reload caml_minor_collections_count inside advance_global_major_slice_epoch, and why?)
I do not think one should "err" one way or another, I think it is better if the reason why it is correct is laid out for the reader without them having to guess. I simply point out that in the event where you only need acq-rel or relaxed ordering, it is often the case that using the most precise ordering makes the code (more) self-documenting. (If looking for a general principle, reasoning about correctness takes precedence over optimizations; if some code has in addition been noticed to be performance-sensitive then I think it makes sense to write it in a comment.) Note that I do not know if we are in this situation here given the interaction with the atomic_exchange, an alternative could be to add comments.
Keep in mind that this is in a context where atomic variables accumulate in the multicore runtime, it has been hard to read in other places, and there has been synchronisation issues in the past for things that looked simple.
There was a problem hiding this comment.
If looking for a general principle, reasoning about correctness takes precedence over optimizations
precisely why I chose SC here
There was a problem hiding this comment.
I see the merits of @gadmm's position, but I would still side with @kayceesrk on this point.
There was a problem hiding this comment.
For some reason I never got the e-mail notification of your reply @kayceesrk, sorry for the delay.
precisely why I chose SC here
Earlier you said that the principle applied here was an acquire-release pattern, now you say that using SC ordering helps with the reasoning. Which one of the two is correct? (From how I learnt to reason about parallel code, there is a contradiction here---but perhaps I am just missing something about what you mean which is obvious to you, that I still do not see.)
What I see is a function that reloads an atomic variable for no apparent reason, which is a yellow flag in parallel code to me. It is possible that the reloading is harmless, for simple reasons even, but I am not going to spend time to try to convince myself that it is so, because when I asked to clarify the code somehow, you declined.
In any case this was only meant to be a small follow-up to the review I previously did on the draft of this PR, I did not want to make an example of the PR about this code style issue (which reminded me of frustrations when trying to review the multicore runtime) nor block the PR on it.
There was a problem hiding this comment.
Earlier you said that the principle applied here was an acquire-release pattern,
I don't think I mentioned that I am using acquire/release reasoning here. I mention in https://github.com/ocaml/ocaml/pull/11750/files#r1046733634 that acquire/release is sufficient, but not that I am using acquire/release reasoning. I must be missing something.
In any case this was only meant to be a small follow-up to the review I previously did on the draft of this PR, I did not want to make an example of the PR about this code style issue (which reminded me of frustrations when trying to review the multicore runtime) nor block the PR on it.
Thanks.
|
Now #11806 is merged you should be good for a rebase and hopefully green CI run @kayceesrk |
|
Here are some numbers from this PR and rc1 on RC1$ OCAMLRUNPARAM="v=0x400" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547647
minor_words: 925546573
promoted_words: 69291317
major_words: 69292391
minor_collections: 4417
major_collections: 885
forced_major_collections: 0
heap_words: 271348
top_heap_words: 304116
mean_space_overhead: 138.830102
Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
User time (seconds): 2.82
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.83
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 8708
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2605
Voluntary context switches: 1
Involuntary context switches: 5
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0This PR$ OCAMLRUNPARAM="v=0x400" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547728
minor_words: 925546654
promoted_words: 69291383
major_words: 69292457
minor_collections: 4253
major_collections: 721
forced_major_collections: 0
heap_words: 316404
top_heap_words: 332788
mean_space_overhead: 126.859242
Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
User time (seconds): 2.70
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.71
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 8968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1464
Voluntary context switches: 1
Involuntary context switches: 6
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0RC1 with space overhead of 140FWIW I ran RC1 version with a space overhead of 140, and the $ OCAMLRUNPARAM="v=0x400,o=140" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547648
minor_words: 925546574
promoted_words: 69291317
major_words: 69292391
minor_collections: 4242
major_collections: 710
forced_major_collections: 0
heap_words: 308212
top_heap_words: 332788
mean_space_overhead: 129.885954
Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
User time (seconds): 2.82
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.83
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 8904
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2755
Voluntary context switches: 1
Involuntary context switches: 5
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0I conjecture that what we're observing is the difference due to when the budgets are computed for the major slices. Whether it is at the end of the minor GC (trunk) or during the middle of it (this PR). In the worst case, the effects of this change are limited to the time proportional to 1/2 of the minor heap size. @sadiqj not sure if you want me to dig into this deeper. |
3f92e26 to
1e6638b
Compare
|
There is another failure in the CI. I’m going to have a look at this tomorrow. |
Fixes a bug where young_trigger may not be updated to young_start at the end of the major slice (specifically, when the slice was interrupted).
|
Looks like the last CI failure is gone now. |
No. Looks good. I had a look at the pacing and I think the biggest impact is probably that I suspect we see a large swing on the lwt benchmark because the major heap is so small. |
|
@damiendoligez Wondered if you were keen to review this PR? |
Are there benchmarks showing latency improvements with this patch? The original PR (#297) doesn't seem to have any either. A Jane Street blog suggests it might be about having finer control when manually scheduling collections. |
The short answer is no due to instrumentation & tooling not doing the right thing. See tarides/runtime_events_tools#7. With this PR, the |
|
Reviewing now. |
damiendoligez
left a comment
There was a problem hiding this comment.
Looks good but I have a question.
| CAMLextern atomic_uintnat caml_minor_collections_count; | ||
|
|
||
| /* The epoch number for major slice. Used to trigger major slices. | ||
| Always, [caml_major_slice_epoch <= caml_minor_collections_count] */ |
There was a problem hiding this comment.
Is this invariant important? I think solving #11589 will involve decoupling caml_major_slice_epoch from caml_minor_collections_count and I'm wondering what would be broken by that.
There was a problem hiding this comment.
caml_major_slice_epoch follows caml_minor_collections_count and is updated to caml_minor_collections_count when the first domain uses up half of its minor heap arena. Domains compare their own (domain-local) major slice epochs (Caml_state->major_slice_epoch)with this global caml_major_slice_epoch to decide whether to trigger a major slice.
I don't know the details of the solution to #11589. But as long as we can identify when a domain has used half of its minor heap arena, we can trigger major slices. For example, one may potentially replace caml_major_slice_epoch with a boolean flag that tracks whether some domain has used up half of its minor heap arena. A boolean flag would be needed per domain as well to track whether that domain has performed the major slice for the current cycle. Such a boolean would be reset at the stop-the-world phase of the minor collection.
With a boolean flag, some other logic would be needed to ensure that those domains which didn't get a chance to run a major slice during a minor cycle trigger a slice as soon as the minor cycle is done. But I suspect that the solution to #11589 would change this part of the scheduling logic anyway.
In summary, we can work around any change that the solution to #11589 would bring in. The absolute counts aren't critical.
| void caml_major_collection_slice(intnat howmuch) | ||
| { | ||
| caml_domain_state *d = Caml_state; | ||
| uintnat major_slice_epoch = atomic_load (&caml_major_slice_epoch); |
There was a problem hiding this comment.
I see the merits of @gadmm's position, but I would still side with @kayceesrk on this point.
|
I pushed a Changes entry. The PR has 2 approvals and I believe it is ready to merge. |
In OCaml 5.1.0, the number of minor collections is an atomic global state variable (because domains need to synchronize on this count) (see ocaml/ocaml#11750). Signed-off-by: Florian Angeletti <florian.angeletti@inria.fr>

OCaml 4 triggers a major slice when the minor heap is half full. This was shown to improve the responsiveness of programs. Multicore OCaml did not implement this functionality. This PR restores this functionality and extends it to parallel programs. In a parallel program, when one of the domains uses up half of its minor heap arena, it signals other domains to perform a slice of the major GC work.
Improvement
Sequential
A slice of the GC trace (obtained through
olly) of the sequential binary trees benchmark compiled with trunk is shown below:Observe that the major slice immediately follows the minor collection.
The trace with this PR is shown below:
Observe that the major slices are now separated from the minor collection. They're triggered when the minor heap is half-full.
Parallel
Here is the slice of the trace from the parallel version of the binary trees benchmark. The trace with trunk (shown below) again shows that major slice immediately follows the minor collection in the same stop-the-world section. The major slices of the domains except the first one are short and hence hard to see on the trace. I've highlighted them with a red box.
With this PR, the major slices are spaced out between the stop-the-world minor collections:
Performance impact
As a sanity check, I measured the performance impact of this change on sequential programs in Sandmark. Here are the normalised results compared to the
trunkbaseline:Time:

Instructions retired:

Some swing in the wall clock time is expected. Instructions retired helps place the wall-clock time swings in context. From the normalised instructions retired graph, one can see that most programs execute similar number of instructions. The odd-one-outs are the
lwtprograms towards the right. Two of these can be explained by the increase in the number of major GCs.Major GCs:

You can see that the two
lwtprograms end up doing more major GCs than trunk. This may explain additional instructions executed.Since we are changing when the major slices are scheduled, some change is expected in the GC speed. I plan to spend a little bit of time understanding why this is more significant in the
lwtprograms than the other ones. That said, I don't expect this to change the PR much. I consider the PR to be ready for review.Sandmark nightly is currently down due to a broken dependency. A fix is being actively worked on. I am expecting parallelism results to be available early next week.