Decouple major slice from minor GCs by kayceesrk · Pull Request #11750 · ocaml/ocaml

kayceesrk · 2022-11-25T04:22:25Z

OCaml 4 triggers a major slice when the minor heap is half full. This was shown to improve the responsiveness of programs. Multicore OCaml did not implement this functionality. This PR restores this functionality and extends it to parallel programs. In a parallel program, when one of the domains uses up half of its minor heap arena, it signals other domains to perform a slice of the major GC work.

Improvement

Sequential

A slice of the GC trace (obtained through olly) of the sequential binary trees benchmark compiled with trunk is shown below:

Observe that the major slice immediately follows the minor collection.

The trace with this PR is shown below:

Observe that the major slices are now separated from the minor collection. They're triggered when the minor heap is half-full.

Parallel

Here is the slice of the trace from the parallel version of the binary trees benchmark. The trace with trunk (shown below) again shows that major slice immediately follows the minor collection in the same stop-the-world section. The major slices of the domains except the first one are short and hence hard to see on the trace. I've highlighted them with a red box.

With this PR, the major slices are spaced out between the stop-the-world minor collections:

Performance impact

As a sanity check, I measured the performance impact of this change on sequential programs in Sandmark. Here are the normalised results compared to the trunk baseline:

Time:

Instructions retired:

Some swing in the wall clock time is expected. Instructions retired helps place the wall-clock time swings in context. From the normalised instructions retired graph, one can see that most programs execute similar number of instructions. The odd-one-outs are the lwt programs towards the right. Two of these can be explained by the increase in the number of major GCs.

Major GCs:

You can see that the two lwt programs end up doing more major GCs than trunk. This may explain additional instructions executed.

Since we are changing when the major slices are scheduled, some change is expected in the GC speed. I plan to spend a little bit of time understanding why this is more significant in the lwt programs than the other ones. That said, I don't expect this to change the PR much. I consider the PR to be ready for review.

Sandmark nightly is currently down due to a broken dependency. A fix is being actively worked on. I am expecting parallelism results to be available early next week.

kayceesrk · 2022-11-25T04:46:11Z

The test tests/lib-runtime-events/'test_caml_parallel.ml' is flaky. @sadiqj would help make this test more robust?

lpw25 · 2022-11-25T08:01:36Z

My understanding was that the domains did major GC work whilst blocked waiting for other domains during minor GC. This was mentioned to me as one reason for why the parallel minor collections don't perform worse than the previous concurrent minor collections.

If that is the case then I would expect it to be worth keeping that behaviour. The idea is that it would do it's normal major slice at the half-way point as this PR proposes, and do extra major GC work whenever it was blocked waiting on things. The "credit" mechanism -- present in the old GC, not sure if its implemented yet in the 5 GC -- then automatically shortens the next normal slice to account for the extra work done.

kayceesrk · 2022-11-25T10:21:01Z

@lpw25 The opportunistic major slice mechanism (major slice work when a domain is idle during a stop-the-world parallel minor GC) is still in there and hasn't been changed.

I know @ctk21 worked on implementing the credit mechanism, but I'm not aware of the details. @sadiqj might know.

lpw25 · 2022-11-25T14:22:49Z

Great. Out of interest is there a credit mechanism like in the old GC so that the opportunistic work leads to doing less work at the next slice (or slices -- IIRC the old GC could spread the credit across multiple slices as well)?

kayceesrk · 2022-11-25T14:35:56Z

is there a credit mechanism like in the old GC so that the opportunistic work leads to doing less work at the next slice

There is

ocaml/runtime/caml/domain_state.tbl

Lines 74 to 78 in a4a05dd

    
           DOMAIN_STATE(intnat, major_work_todo) 
        
           /* balance of work to do in this GC clock cycle (in words) 
        
            *  positive: we need to do this amount of work to finish the slice 
        
            *  negative: we have done more than we need and this is credit 
        
            */

Currently, the credit is not spread across multiple slices. This is to be done.

sadiqj · 2022-11-27T20:11:22Z

The test tests/lib-runtime-events/'test_caml_parallel.ml' is flaky. @sadiqj would help make this test more robust?

I reproduced the test failure and it's a little weird (getting an EV_MAJOR end without the start). Am investigating..

kayceesrk · 2022-11-29T14:01:26Z

Parallel scalability results on sandmark shows no difference when compared to trunk (as expected).

damiendoligez · 2022-12-06T13:36:24Z

About the credit mechanism, I'm working on a solution to #11589 that will subsume it.

kayceesrk · 2022-12-07T04:58:15Z

About the credit mechanism, I'm working on a solution to #11589 that will subsume it.

Thanks! That's good to know.

sadiqj

To summarise and make sure I've understood this PR correctly:

We now set a young_trigger which is halfway through the minor heap (logic borrowed from 4.14)
The first domain to hit the young_trigger:
- Sets the major slice epoch to the current number of minor collections
- Interrupts all other domains
Each domain does a major slice because the domain local major slice epoch is now less than the global one

I think the logic for doing so works and the implementation matches the above.

It would be nice to know why the GC pacing changes significantly for a few of the sandmark benchmarks though.

kayceesrk · 2022-12-12T15:46:43Z

Thanks for the review. I'll do an analysis of thread_ring_lwt_mvar benchmark and report back here.

gadmm · 2022-12-12T16:05:15Z

runtime/major_gc.c

 void caml_major_collection_slice(intnat howmuch)
 {
+  caml_domain_state *d = Caml_state;
+  uintnat major_slice_epoch = atomic_load (&caml_major_slice_epoch);


Is there a reason for the SC ordering on this variable here and elsewhere and not a weaker one? (I am asking from the point of view that using the weakest ordering makes the code easier to understand since relaxed and acq/rel ordering are often self-documenting; when I see the use of SC ordering I need some lengthy comment explaining what is being synchronised and how.)

For code that is not performance sensitive, my approach had been to pick the ordering that doesn't require complicated reasoning. Hence the choice of SC atomics there. When I see acquire/release atomics my takeaway would be that "someone has tried to optimise this and so this must be performance sensitive". What approach do we want to take in the runtime code. @xavierleroy?

FWIW release/acquire ordering should be sufficient here.

I'm in agreement with @kayceesrk here, when in doubt (and it isn't somewhere very hot) then SC is fine. We should err to be too strong than too weak.

I think the issue is that it is not obvious to me what ordering you need to reason that the code is correct. (For instance, is it innocuous to reload caml_minor_collections_count inside advance_global_major_slice_epoch, and why?)

I do not think one should "err" one way or another, I think it is better if the reason why it is correct is laid out for the reader without them having to guess. I simply point out that in the event where you only need acq-rel or relaxed ordering, it is often the case that using the most precise ordering makes the code (more) self-documenting. (If looking for a general principle, reasoning about correctness takes precedence over optimizations; if some code has in addition been noticed to be performance-sensitive then I think it makes sense to write it in a comment.) Note that I do not know if we are in this situation here given the interaction with the atomic_exchange, an alternative could be to add comments.

Keep in mind that this is in a context where atomic variables accumulate in the multicore runtime, it has been hard to read in other places, and there has been synchronisation issues in the past for things that looked simple.

If looking for a general principle, reasoning about correctness takes precedence over optimizations

precisely why I chose SC here

I see the merits of @gadmm's position, but I would still side with @kayceesrk on this point.

For some reason I never got the e-mail notification of your reply @kayceesrk, sorry for the delay.

precisely why I chose SC here

Earlier you said that the principle applied here was an acquire-release pattern, now you say that using SC ordering helps with the reasoning. Which one of the two is correct? (From how I learnt to reason about parallel code, there is a contradiction here---but perhaps I am just missing something about what you mean which is obvious to you, that I still do not see.)

What I see is a function that reloads an atomic variable for no apparent reason, which is a yellow flag in parallel code to me. It is possible that the reloading is harmless, for simple reasons even, but I am not going to spend time to try to convince myself that it is so, because when I asked to clarify the code somehow, you declined.

In any case this was only meant to be a small follow-up to the review I previously did on the draft of this PR, I did not want to make an example of the PR about this code style issue (which reminded me of frustrations when trying to review the multicore runtime) nor block the PR on it.

Earlier you said that the principle applied here was an acquire-release pattern,

I don't think I mentioned that I am using acquire/release reasoning here. I mention in https://github.com/ocaml/ocaml/pull/11750/files#r1046733634 that acquire/release is sufficient, but not that I am using acquire/release reasoning. I must be missing something.

In any case this was only meant to be a small follow-up to the review I previously did on the draft of this PR, I did not want to make an example of the PR about this code style issue (which reminded me of frustrations when trying to review the multicore runtime) nor block the PR on it.

Thanks.

sadiqj · 2022-12-12T18:18:57Z

Now #11806 is merged you should be good for a rebase and hopefully green CI run @kayceesrk

kayceesrk · 2022-12-13T06:13:58Z

Here are some numbers from this PR and rc1 on thread_ring_lwt_mvar. There isn't much of a difference between this PR and RC1 except on major_collections and *heap_words.

RC1

$ OCAMLRUNPARAM="v=0x400" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547647
minor_words: 925546573
promoted_words: 69291317
major_words: 69292391
minor_collections: 4417
major_collections: 885
forced_major_collections: 0
heap_words: 271348
top_heap_words: 304116
mean_space_overhead: 138.830102
        Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
        User time (seconds): 2.82
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.83
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 8708
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2605
        Voluntary context switches: 1
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

This PR

$ OCAMLRUNPARAM="v=0x400" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547728
minor_words: 925546654
promoted_words: 69291383
major_words: 69292457
minor_collections: 4253
major_collections: 721
forced_major_collections: 0
heap_words: 316404
top_heap_words: 332788
mean_space_overhead: 126.859242
        Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
        User time (seconds): 2.70
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 8968
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1464
        Voluntary context switches: 1
        Involuntary context switches: 6
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

RC1 with space overhead of 140

FWIW I ran RC1 version with a space overhead of 140, and the top_heap_words matches the one seen in this PR. The major and minor collection numbers are also closer.

$ OCAMLRUNPARAM="v=0x400,o=140" /usr/bin/time -v _build/default/thread_ring_lwt_mvar.exe 20_000
allocated_words: 925547648
minor_words: 925546574
promoted_words: 69291317
major_words: 69292391
minor_collections: 4242
major_collections: 710
forced_major_collections: 0
heap_words: 308212
top_heap_words: 332788
mean_space_overhead: 129.885954
        Command being timed: "_build/default/thread_ring_lwt_mvar.exe 20_000"
        User time (seconds): 2.82
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.83
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 8904
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2755
        Voluntary context switches: 1
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I conjecture that what we're observing is the difference due to when the budgets are computed for the major slices. Whether it is at the end of the minor GC (trunk) or during the middle of it (this PR). In the worst case, the effects of this change are limited to the time proportional to 1/2 of the minor heap size.

@sadiqj not sure if you want me to dig into this deeper.

kayceesrk · 2022-12-13T14:35:04Z

There is another failure in the CI. I’m going to have a look at this tomorrow.

Fixes a bug where young_trigger may not be updated to young_start at the end of the major slice (specifically, when the slice was interrupted).

kayceesrk · 2022-12-14T16:45:10Z

Looks like the last CI failure is gone now.

sadiqj · 2022-12-15T10:21:20Z

@sadiqj not sure if you want me to dig into this deeper.

No. Looks good. I had a look at the pacing and I think the biggest impact is probably that Caml_state->allocated_words would have reflected all the major heap allocations from the minor collection that was just before the major slice in progress. Now there's potentially been half a minor heap in allocations that aren't reflected.

I suspect we see a large swing on the lwt benchmark because the major heap is so small.

kayceesrk · 2022-12-23T03:54:52Z

@damiendoligez Wondered if you were keen to review this PR?

wikku · 2022-12-23T21:52:23Z

OCaml 4 triggers a major slice when the minor heap is half full. This was shown to improve the responsiveness of programs.

Are there benchmarks showing latency improvements with this patch? The original PR (#297) doesn't seem to have any either. A Jane Street blog suggests it might be about having finer control when manually scheduling collections.

kayceesrk · 2022-12-24T05:53:48Z

Are there benchmarks showing latency improvements with this patch?

The short answer is no due to instrumentation & tooling not doing the right thing. See tarides/runtime_events_tools#7. With this PR, the olly reported latency for that specific event mentioned in the issue will be the correct one.

damiendoligez · 2023-01-03T12:55:14Z

Reviewing now.

damiendoligez

Looks good but I have a question.

damiendoligez · 2023-01-03T12:59:00Z

runtime/caml/minor_gc.h

+CAMLextern atomic_uintnat caml_minor_collections_count;
+
+/* The epoch number for major slice. Used to trigger major slices.
+   Always, [caml_major_slice_epoch <= caml_minor_collections_count] */


Is this invariant important? I think solving #11589 will involve decoupling caml_major_slice_epoch from caml_minor_collections_count and I'm wondering what would be broken by that.

caml_major_slice_epoch follows caml_minor_collections_count and is updated to caml_minor_collections_count when the first domain uses up half of its minor heap arena. Domains compare their own (domain-local) major slice epochs (Caml_state->major_slice_epoch)with this global caml_major_slice_epoch to decide whether to trigger a major slice.

I don't know the details of the solution to #11589. But as long as we can identify when a domain has used half of its minor heap arena, we can trigger major slices. For example, one may potentially replace caml_major_slice_epoch with a boolean flag that tracks whether some domain has used up half of its minor heap arena. A boolean flag would be needed per domain as well to track whether that domain has performed the major slice for the current cycle. Such a boolean would be reset at the stop-the-world phase of the minor collection.

With a boolean flag, some other logic would be needed to ensure that those domains which didn't get a chance to run a major slice during a minor cycle trigger a slice as soon as the minor cycle is done. But I suspect that the solution to #11589 would change this part of the scheduling logic anyway.

In summary, we can work around any change that the solution to #11589 would bring in. The absolute counts aren't critical.

damiendoligez · 2023-01-03T13:16:08Z

runtime/major_gc.c

 void caml_major_collection_slice(intnat howmuch)
 {
+  caml_domain_state *d = Caml_state;
+  uintnat major_slice_epoch = atomic_load (&caml_major_slice_epoch);


I see the merits of @gadmm's position, but I would still side with @kayceesrk on this point.

kayceesrk · 2023-01-11T04:36:47Z

I pushed a Changes entry. The PR has 2 approvals and I believe it is ready to merge.

In OCaml 5.1.0, the number of minor collections is an atomic global state variable (because domains need to synchronize on this count) (see ocaml/ocaml#11750). Signed-off-by: Florian Angeletti <florian.angeletti@inria.fr>

sadiqj assigned sadiqj and unassigned sadiqj Nov 27, 2022

kayceesrk mentioned this pull request Dec 6, 2022

Decouple major and minor2 kayceesrk/ocaml#8

Closed

sadiqj mentioned this pull request Dec 11, 2022

fix buggy runtime events test_caml_parallel test #11806

Merged

sadiqj approved these changes Dec 12, 2022

View reviewed changes

gadmm reviewed Dec 12, 2022

View reviewed changes

Decouple major slice from minor GCs

1e6638b

kayceesrk force-pushed the decouple_major_and_minor3 branch from 3f92e26 to 1e6638b Compare December 13, 2022 07:07

Reliably update young_trigger.

51ecd6f

Fixes a bug where young_trigger may not be updated to young_start at the end of the major slice (specifically, when the slice was interrupted).

damiendoligez approved these changes Jan 3, 2023

View reviewed changes

Add Changes entry

e486dd2

gasche merged commit 34cf5aa into ocaml:trunk Jan 11, 2023

kayceesrk mentioned this pull request Feb 21, 2023

olly latency does not compute latency correctly tarides/runtime_events_tools#7

Closed

kayceesrk mentioned this pull request Apr 12, 2023

Fix idle domain gc #11903

Merged

Octachron mentioned this pull request Jun 28, 2023

OCaml 5.1 support: gc_stubs fix janestreet/core#164

Merged

gasche mentioned this pull request Mar 4, 2024

Stash the bytecode interpreter's internal structures when OCaml's runtime is processing a signal. rocq-prover/rocq#18599

Merged

damiendoligez mentioned this pull request Apr 9, 2024

Fix spurious major GC slices. #13086

Merged

Conversation

kayceesrk commented Nov 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improvement

Sequential

Parallel

Performance impact

Uh oh!

kayceesrk commented Nov 25, 2022

Uh oh!

lpw25 commented Nov 25, 2022

Uh oh!

kayceesrk commented Nov 25, 2022

Uh oh!

lpw25 commented Nov 25, 2022

Uh oh!

kayceesrk commented Nov 25, 2022

Uh oh!

sadiqj commented Nov 27, 2022

Uh oh!

kayceesrk commented Nov 29, 2022

Uh oh!

damiendoligez commented Dec 6, 2022

Uh oh!

kayceesrk commented Dec 7, 2022

Uh oh!

sadiqj left a comment

Choose a reason for hiding this comment

Uh oh!

kayceesrk commented Dec 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadiqj commented Dec 12, 2022

Uh oh!

kayceesrk commented Dec 13, 2022

RC1

This PR

RC1 with space overhead of 140

Uh oh!

kayceesrk commented Dec 13, 2022

Uh oh!

kayceesrk commented Dec 14, 2022

Uh oh!

sadiqj commented Dec 15, 2022

Uh oh!

kayceesrk commented Dec 23, 2022

Uh oh!

wikku commented Dec 23, 2022

Uh oh!

kayceesrk commented Dec 24, 2022

Uh oh!

damiendoligez commented Jan 3, 2023

Uh oh!

damiendoligez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kayceesrk commented Jan 11, 2023

Uh oh!

Reviewers

kayceesrk commented Nov 25, 2022 •

edited

Loading