Skip to content

Decouple major slice from minor GCs#11750

Merged
gasche merged 3 commits intoocaml:trunkfrom
kayceesrk:decouple_major_and_minor3
Jan 11, 2023
Merged

Decouple major slice from minor GCs#11750
gasche merged 3 commits intoocaml:trunkfrom
kayceesrk:decouple_major_and_minor3

Conversation

@kayceesrk
Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk commented Nov 25, 2022

OCaml 4 triggers a major slice when the minor heap is half full. This was shown to improve the responsiveness of programs. Multicore OCaml did not implement this functionality. This PR restores this functionality and extends it to parallel programs. In a parallel program, when one of the domains uses up half of its minor heap arena, it signals other domains to perform a slice of the major GC work.

Improvement

Sequential

A slice of the GC trace (obtained through olly) of the sequential binary trees benchmark compiled with trunk is shown below:

image

Observe that the major slice immediately follows the minor collection.

The trace with this PR is shown below:

image

Observe that the major slices are now separated from the minor collection. They're triggered when the minor heap is half-full.

Parallel

Here is the slice of the trace from the parallel version of the binary trees benchmark. The trace with trunk (shown below) again shows that major slice immediately follows the minor collection in the same stop-the-world section. The major slices of the domains except the first one are short and hence hard to see on the trace. I've highlighted them with a red box.

With this PR, the major slices are spaced out between the stop-the-world minor collections:

Performance impact

As a sanity check, I measured the performance impact of this change on sequential programs in Sandmark. Here are the normalised results compared to the trunk baseline:

Time:
image

Instructions retired:
image

Some swing in the wall clock time is expected. Instructions retired helps place the wall-clock time swings in context. From the normalised instructions retired graph, one can see that most programs execute similar number of instructions. The odd-one-outs are the lwt programs towards the right. Two of these can be explained by the increase in the number of major GCs.

Major GCs:
image
You can see that the two lwt programs end up doing more major GCs than trunk. This may explain additional instructions executed.

Since we are changing when the major slices are scheduled, some change is expected in the GC speed. I plan to spend a little bit of time understanding why this is more significant in the lwt programs than the other ones. That said, I don't expect this to change the PR much. I consider the PR to be ready for review.

Sandmark nightly is currently down due to a broken dependency. A fix is being actively worked on. I am expecting parallelism results to be available early next week.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

The test tests/lib-runtime-events/'test_caml_parallel.ml' is flaky. @sadiqj would help make this test more robust?

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Nov 25, 2022

My understanding was that the domains did major GC work whilst blocked waiting for other domains during minor GC. This was mentioned to me as one reason for why the parallel minor collections don't perform worse than the previous concurrent minor collections.

If that is the case then I would expect it to be worth keeping that behaviour. The idea is that it would do it's normal major slice at the half-way point as this PR proposes, and do extra major GC work whenever it was blocked waiting on things. The "credit" mechanism -- present in the old GC, not sure if its implemented yet in the 5 GC -- then automatically shortens the next normal slice to account for the extra work done.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

@lpw25 The opportunistic major slice mechanism (major slice work when a domain is idle during a stop-the-world parallel minor GC) is still in there and hasn't been changed.

I know @ctk21 worked on implementing the credit mechanism, but I'm not aware of the details. @sadiqj might know.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Nov 25, 2022

Great. Out of interest is there a credit mechanism like in the old GC so that the opportunistic work leads to doing less work at the next slice (or slices -- IIRC the old GC could spread the credit across multiple slices as well)?

@kayceesrk
Copy link
Copy Markdown
Contributor Author

is there a credit mechanism like in the old GC so that the opportunistic work leads to doing less work at the next slice

There is

DOMAIN_STATE(intnat, major_work_todo)
/* balance of work to do in this GC clock cycle (in words)
* positive: we need to do this amount of work to finish the slice
* negative: we have done more than we need and this is credit
*/

Currently, the credit is not spread across multiple slices. This is to be done.

@sadiqj sadiqj assigned sadiqj and unassigned sadiqj Nov 27, 2022
@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Nov 27, 2022

The test tests/lib-runtime-events/'test_caml_parallel.ml' is flaky. @sadiqj would help make this test more robust?

I reproduced the test failure and it's a little weird (getting an EV_MAJOR end without the start). Am investigating..

@kayceesrk
Copy link
Copy Markdown
Contributor Author

Parallel scalability results on sandmark shows no difference when compared to trunk (as expected).

image

@damiendoligez
Copy link
Copy Markdown
Member

About the credit mechanism, I'm working on a solution to #11589 that will subsume it.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

About the credit mechanism, I'm working on a solution to #11589 that will subsume it.

Thanks! That's good to know.

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarise and make sure I've understood this PR correctly:

  • We now set a young_trigger which is halfway through the minor heap (logic borrowed from 4.14)
  • The first domain to hit the young_trigger:
    • Sets the major slice epoch to the current number of minor collections
    • Interrupts all other domains
  • Each domain does a major slice because the domain local major slice epoch is now less than the global one

I think the logic for doing so works and the implementation matches the above.

It would be nice to know why the GC pacing changes significantly for a few of the sandmark benchmarks though.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

Thanks for the review. I'll do an analysis of thread_ring_lwt_mvar benchmark and report back here.

void caml_major_collection_slice(intnat howmuch)
{
caml_domain_state *d = Caml_state;
uintnat major_slice_epoch = atomic_load (&caml_major_slice_epoch);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the SC ordering on this variable here and elsewhere and not a weaker one? (I am asking from the point of view that using the weakest ordering makes the code easier to understand since relaxed and acq/rel ordering are often self-documenting; when I see the use of SC ordering I need some lengthy comment explaining what is being synchronised and how.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For code that is not performance sensitive, my approach had been to pick the ordering that doesn't require complicated reasoning. Hence the choice of SC atomics there. When I see acquire/release atomics my takeaway would be that "someone has tried to optimise this and so this must be performance sensitive". What approach do we want to take in the runtime code. @xavierleroy?

FWIW release/acquire ordering should be sufficient here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in agreement with @kayceesrk here, when in doubt (and it isn't somewhere very hot) then SC is fine. We should err to be too strong than too weak.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that it is not obvious to me what ordering you need to reason that the code is correct. (For instance, is it innocuous to reload caml_minor_collections_count inside advance_global_major_slice_epoch, and why?)

I do not think one should "err" one way or another, I think it is better if the reason why it is correct is laid out for the reader without them having to guess. I simply point out that in the event where you only need acq-rel or relaxed ordering, it is often the case that using the most precise ordering makes the code (more) self-documenting. (If looking for a general principle, reasoning about correctness takes precedence over optimizations; if some code has in addition been noticed to be performance-sensitive then I think it makes sense to write it in a comment.) Note that I do not know if we are in this situation here given the interaction with the atomic_exchange, an alternative could be to add comments.

Keep in mind that this is in a context where atomic variables accumulate in the multicore runtime, it has been hard to read in other places, and there has been synchronisation issues in the past for things that looked simple.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If looking for a general principle, reasoning about correctness takes precedence over optimizations

precisely why I chose SC here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the merits of @gadmm's position, but I would still side with @kayceesrk on this point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I never got the e-mail notification of your reply @kayceesrk, sorry for the delay.

precisely why I chose SC here

Earlier you said that the principle applied here was an acquire-release pattern, now you say that using SC ordering helps with the reasoning. Which one of the two is correct? (From how I learnt to reason about parallel code, there is a contradiction here---but perhaps I am just missing something about what you mean which is obvious to you, that I still do not see.)

What I see is a function that reloads an atomic variable for no apparent reason, which is a yellow flag in parallel code to me. It is possible that the reloading is harmless, for simple reasons even, but I am not going to spend time to try to convince myself that it is so, because when I asked to clarify the code somehow, you declined.

In any case this was only meant to be a small follow-up to the review I previously did on the draft of this PR, I did not want to make an example of the PR about this code style issue (which reminded me of frustrations when trying to review the multicore runtime) nor block the PR on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier you said that the principle applied here was an acquire-release pattern,

I don't think I mentioned that I am using acquire/release reasoning here. I mention in https://github.com/ocaml/ocaml/pull/11750/files#r1046733634 that acquire/release is sufficient, but not that I am using acquire/release reasoning. I must be missing something.

In any case this was only meant to be a small follow-up to the review I previously did on the draft of this PR, I did not want to make an example of the PR about this code style issue (which reminded me of frustrations when trying to review the multicore runtime) nor block the PR on it.

Thanks.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Dec 12, 2022

Now #11806 is merged you should be good for a rebase and hopefully green CI run @kayceesrk

@kayceesrk
Copy link
Copy Markdown
Contributor Author

Here are some numbers from this PR and rc1 on thread_ring_lwt_mvar. There isn't much of a difference between this PR and RC1 except on major_collections and *heap_words.

RC1

$ OCAMLRUNPARAM="v=0x400" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547647
minor_words: 925546573
promoted_words: 69291317
major_words: 69292391
minor_collections: 4417
major_collections: 885
forced_major_collections: 0
heap_words: 271348
top_heap_words: 304116
mean_space_overhead: 138.830102
        Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
        User time (seconds): 2.82
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.83
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 8708
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2605
        Voluntary context switches: 1
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

This PR

$ OCAMLRUNPARAM="v=0x400" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547728
minor_words: 925546654
promoted_words: 69291383
major_words: 69292457
minor_collections: 4253
major_collections: 721
forced_major_collections: 0
heap_words: 316404
top_heap_words: 332788
mean_space_overhead: 126.859242
        Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
        User time (seconds): 2.70
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 8968
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1464
        Voluntary context switches: 1
        Involuntary context switches: 6
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

RC1 with space overhead of 140

FWIW I ran RC1 version with a space overhead of 140, and the top_heap_words matches the one seen in this PR. The major and minor collection numbers are also closer.

$ OCAMLRUNPARAM="v=0x400,o=140" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547648
minor_words: 925546574
promoted_words: 69291317
major_words: 69292391
minor_collections: 4242
major_collections: 710
forced_major_collections: 0
heap_words: 308212
top_heap_words: 332788
mean_space_overhead: 129.885954
        Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
        User time (seconds): 2.82
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.83
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 8904
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2755
        Voluntary context switches: 1
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I conjecture that what we're observing is the difference due to when the budgets are computed for the major slices. Whether it is at the end of the minor GC (trunk) or during the middle of it (this PR). In the worst case, the effects of this change are limited to the time proportional to 1/2 of the minor heap size.

@sadiqj not sure if you want me to dig into this deeper.

@kayceesrk kayceesrk force-pushed the decouple_major_and_minor3 branch from 3f92e26 to 1e6638b Compare December 13, 2022 07:07
@kayceesrk
Copy link
Copy Markdown
Contributor Author

There is another failure in the CI. I’m going to have a look at this tomorrow.

Fixes a bug where young_trigger may not be updated to young_start at the
end of the major slice (specifically, when the slice was interrupted).
@kayceesrk
Copy link
Copy Markdown
Contributor Author

Looks like the last CI failure is gone now.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Dec 15, 2022

@sadiqj not sure if you want me to dig into this deeper.

No. Looks good. I had a look at the pacing and I think the biggest impact is probably that Caml_state->allocated_words would have reflected all the major heap allocations from the minor collection that was just before the major slice in progress. Now there's potentially been half a minor heap in allocations that aren't reflected.

I suspect we see a large swing on the lwt benchmark because the major heap is so small.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

@damiendoligez Wondered if you were keen to review this PR?

@wikku
Copy link
Copy Markdown
Contributor

wikku commented Dec 23, 2022

OCaml 4 triggers a major slice when the minor heap is half full. This was shown to improve the responsiveness of programs.

Are there benchmarks showing latency improvements with this patch? The original PR (#297) doesn't seem to have any either. A Jane Street blog suggests it might be about having finer control when manually scheduling collections.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

Are there benchmarks showing latency improvements with this patch?

The short answer is no due to instrumentation & tooling not doing the right thing. See tarides/runtime_events_tools#7. With this PR, the olly reported latency for that specific event mentioned in the issue will be the correct one.

@damiendoligez
Copy link
Copy Markdown
Member

Reviewing now.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I have a question.

CAMLextern atomic_uintnat caml_minor_collections_count;

/* The epoch number for major slice. Used to trigger major slices.
Always, [caml_major_slice_epoch <= caml_minor_collections_count] */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this invariant important? I think solving #11589 will involve decoupling caml_major_slice_epoch from caml_minor_collections_count and I'm wondering what would be broken by that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caml_major_slice_epoch follows caml_minor_collections_count and is updated to caml_minor_collections_count when the first domain uses up half of its minor heap arena. Domains compare their own (domain-local) major slice epochs (Caml_state->major_slice_epoch)with this global caml_major_slice_epoch to decide whether to trigger a major slice.

I don't know the details of the solution to #11589. But as long as we can identify when a domain has used half of its minor heap arena, we can trigger major slices. For example, one may potentially replace caml_major_slice_epoch with a boolean flag that tracks whether some domain has used up half of its minor heap arena. A boolean flag would be needed per domain as well to track whether that domain has performed the major slice for the current cycle. Such a boolean would be reset at the stop-the-world phase of the minor collection.

With a boolean flag, some other logic would be needed to ensure that those domains which didn't get a chance to run a major slice during a minor cycle trigger a slice as soon as the minor cycle is done. But I suspect that the solution to #11589 would change this part of the scheduling logic anyway.

In summary, we can work around any change that the solution to #11589 would bring in. The absolute counts aren't critical.

void caml_major_collection_slice(intnat howmuch)
{
caml_domain_state *d = Caml_state;
uintnat major_slice_epoch = atomic_load (&caml_major_slice_epoch);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the merits of @gadmm's position, but I would still side with @kayceesrk on this point.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

I pushed a Changes entry. The PR has 2 approvals and I believe it is ready to merge.

@gasche gasche merged commit 34cf5aa into ocaml:trunk Jan 11, 2023
@kayceesrk kayceesrk mentioned this pull request Apr 12, 2023
Octachron added a commit to Octachron/core that referenced this pull request Jun 28, 2023
In OCaml 5.1.0, the number of minor collections is an atomic global state
variable (because domains need to synchronize on this count)
(see ocaml/ocaml#11750).

Signed-off-by: Florian Angeletti <florian.angeletti@inria.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants