Fix idle domain gc by damiendoligez · Pull Request #11903 · ocaml/ocaml

damiendoligez · 2023-01-17T13:45:05Z

Trying to fix #11589 (and maybe #11548). Two changes here:

GC work accounting and distribution becomes global.
extra major slices are done globally instead of locally.

The benchmark code is here: https://gist.github.com/damiendoligez/86a3afa6c6a594e0274d6194d91457b7

GC work accounting

Major allocations are counted globally and the counts are evenly distributed among active domains. These counts tell the major GC slices how much work they have to do. The main difficulty is that some domains may run out of major GC work for the current cycle before others. Then they cannot do any more GC work before the global synchronization at the end of the GC cycle (i.e. the STW event that starts the next cycle). In this case we have to redistribute their counts back to the domains that still have something to do.
This accounting tries to guarante that the global speed of major collection keeps up with the speed of allocation.

extra major slices

When we allocate directly in the major heap, we sometimes have to trigger major GC slices in addition to the ones that are automatically run between minor collections. In trunk, this is done only in the allocating domain, but that's not enough: in the benchmark program, the allocating domain has already run out of GC work and these slices do nothing and allocation gets ahead of collection. In order to do these global slices, we have to send a signal to all other domains, but we don't need synchronization, do I modified the STW code to add an asynchronous mode.

benchmark results

test	max heap size (MB)	minor collections	major collections
trunk sequential	30.5	125	116
trunk with idle domain	694.0	121	6
this PR sequential	52.6	112	112
this PR with idle domain	99.9	194	47

We can clearly see some improvement, but I still have to investigate:

~~why the number of minor collections changes so much~~
~~why the sequential test of this PR uses much more memory than trunk~~
~~what is still holding up the major GC~~

Fixes #11589
Fixes #12042

gasche

(I was randomly passing by and have a question.)

gasche · 2023-02-25T13:12:32Z

runtime/major_gc.c

-      }
+      /* reset work counters */
+      caml_plat_lock (&accounting_lock);
+      caml_enumerate_participating_domains (&reset_domain_account);


I'm confused, aren't the arguments int participating_count, caml_domain_state **participating precisely supposed to let us do this (iterate on all participating domains)? If it is enough to access the stw_domains globals, why do we have them in the STW API?

gasche · 2023-02-28T07:46:54Z

What is the status of this PR? It looks like it does fix or at least greatly alleviate memory-consumption issues. Do we want to eventually get it merged? What do we need to do?

damiendoligez · 2023-02-28T14:19:15Z

I'm still working on this. I'm not satisfied with the increase of memory consumption in the sequential case.

kayceesrk · 2023-03-01T04:34:33Z

@damiendoligez I wondered if you had an idealised model for major slice scheduling in the multi-domain case. One thing we can do is to use olly to generate the traces and analyse run post-facto to see how close this PR gets to this ideal. Perfetto has a SQL-based trace analysis framework where we can explore the trace by asking queries.

IINM, runtime events currently does not report counter events for major heap sizes and estimated live data, which would be a pre-requisite for feeding into this idealised model. CC @sadiqj.

damiendoligez · 2023-03-02T15:28:18Z

I don't have a detailed model, I only know that, going back to https://gist.github.com/stedolan/52c13d6b1a30276db31ca98683c8db16, with the default overhead parameter we need to do about 5.2 units of work (the average of m and s) for each word allocated. When the GC does much less than that, the heap will get unacceptably large. The tricky part is distributing this work among the domains that do have some GC work available.

damiendoligez · 2023-04-11T12:36:09Z

I have much simplified the accounting logic: there is now a global counter of how much needs to be done, and each major slice tries to do that much work and decreases the global counter by how much work was actually done.

We still need to trigger major slices on other domains when the current domain runs out of things to do. This means we sometimes have all domains doing a major slice at the same time. There is a work-sharing mechanism to handle this case without doing N times the total amount of work.

The results are now better than trunk, even without any idle thread, but they still get worse when there are idle threads:

This PR:

spawn	heap max (MB)	minor GC	major GC	work-to-alloc	backlog
0	15.62	434	434	4.00	0.00
1	26.53	231	231	3.08	0.23
2	30.99	180	180	2.88	0.28
3	31.12	165	165	2.82	0.29
5	31.15	159	159	2.80	0.30
10	31.71	160	159	2.86	0.28

Trunk:

spawn	heap max (MB)	minor GC	major GC	work-to-alloc
0	30.47	130	126	3.31

The work-to-alloc column gives the ratio of GC work done to number of words allocated in the major heap. When the GC's control loop works perfectly, this is 4 (assuming default GC settings). The backlog colum gives the amount of work that is left in the counter at the end of the program over the total amount of work.

I tried to increase the running time of the test with 0 or 1 idle thread:

no idle thread:

steps	heap max (MB)	minor GC	major GC	work-to-alloc
500	15.62	434	434	4.00
1000	30.27	918	918	4.00
1500	46.14	1409	1409	4.00
2000	61.40	1902	1902	4.00

one idle thread:

steps	heap max (MB)	minor GC	major GC	work-to-alloc	backlog
500	26.53	231	231	3.08	0.23
1000	54.92	473	473	3.03	0.24
1500	89.34	716	716	3.00	0.25
2000	103.00	960	960	3.01	0.25

The work-to-alloc ratio and residual backlog seem to be stable when the benchmark runs longer.

I still need to do a smoke check with sandmark and check whether this will fix #11548.

gasche · 2023-04-11T12:45:57Z

If I read your numbers right, in the sequential case (spawn=0) the benchmark runs many more GC slices in your PR than in trunk: 434 with your PR, 126 with trunk. You say that this is better behavior because the work-to-alloc hits the intended target of 4.0. But is that good? I suppose that there is no issue with throughput because each slice doesless work?

(If I understand correctly, the impact on throughput is approximated by the work-to-alloc ratio, so 4 instead of 3.3 means slightly more time spent in the major GC but not much more, and we get to halve the memory usage in return.)

damiendoligez · 2023-04-11T13:06:56Z

If I read your numbers right, in the sequential case (spawn=0) the benchmark runs many more GC slices in your PR than in trunk: 434 with your PR, 126 with trunk. You say that this is better behavior because the work-to-alloc hits the intended target of 4.0. But is that good? I suppose that there is no issue with throughput because each slice doesless work?

It's actually a number of major GC cycles, not slices. Note that this program is quite atypical: it allocates more in the major heap than in the minor heap (which explains why the number of minor GCs is equal to the number of major cycles: each major cycle forces a minor GC). To answer your question, I think it's good because it lets the GC reach the target work-to-alloc ratio, which trunk doesn't.

kit-ty-kate · 2023-04-11T14:31:31Z

I can confirm that this PR in its current state also fixes #12042

damiendoligez · 2023-04-12T08:44:33Z

@stedolan @kayceesrk : I think this is ready for review and should be integrated in 5.1, although it's likely that some further improvements are possible. They will be left for a later PR.

kayceesrk · 2023-04-12T10:17:36Z

runtime/domain.c

  domain_state->dependent_allocated = 0;

+  domain_state->major_work_done_between_slices = 0;
+


The other newly introduced domain-local variables also need to be initialised.

kayceesrk · 2023-04-12T11:22:23Z

The changes are surprisingly simple and look good to me. I should say that I haven't fully checked whether the new slice computation does the right thing. This is something that @stedolan may want to do.

One component that I've been wondering about is

We still need to trigger major slices on other domains when the current domain runs out of things to do.

We have to do something similar for triggering major slices when the first domain in a minor cycle uses half of its minor heap: #11750. But the mechanisms are different. #11750 interrupts all of the domains by obtaining the all_domains_lock and sets the interrupt word to -1. See advance_global_major_slice_epoch this PR uses an asynchronous version of caml_try_run_on_all_domains_async in order to trigger major slice globally.

I wondered whether both of these can be expressed using a similar mechanism. Have you thought about this @damiendoligez?

gasche · 2023-04-21T12:28:50Z

@kayceesrk we discussed this PR at the triaging meeting today. We would like to merge some version of it in 5.1 if possible. If you think that you wouldn't have the time to go to the finish line in the next few weeks/months, let us know and we will look for another potential reviewer.

kayceesrk

I've gone through the computation for updating gc work, and they look correct to me.

I would be curious to see Sandmark results on this. Has this already been done @damiendoligez?

kayceesrk · 2023-04-24T04:09:34Z

The sandmark results are available now. Only the results from the turing machine (28 cores) is available. The 128 core navajo seems to be down.

Sequential

Sequential results are here.

There are a few "differences" from the trunk version. In terms of running time, there are a few regressions:

It may be worthwhile investigating pi_digits5 benchmark which shows 50% slowdown.

There's a drop in maxRSS for the benchmarks with the biggest slowdown

See that the two benchmarks pi_digits5 and zarith_pi which showed the biggest slowdowns also seem to use much less memory.

This PR also make the program do a lot more major and minor GCs.

Normalised minor collection count

Normalised major collection count

Do we expect this PR to increase the major and minor GC counts? (Not that this number alone matters, as the running time + maxRSS is perhaps a better representation of the user-observable behaviour).

Parallel

The parallel speedup results are here.

On most benchmarks, this PR performs exactly the same. On test_decompress, there is a significant 2x speedup over trunk. I suspect that this benchmark exhibits the idle-domain behaviour that this benchmark fixes. This is excellent.

damiendoligez · 2023-04-26T15:21:06Z

@kayceesrk

The sandmark results are available now. Only the results from the turing machine (28 cores) is available. The 128 core navajo seems to be down.

Thanks for running the benchmarks.

It may be worthwhile investigating pi_digits5 benchmark which shows 50% slowdown.
There's a drop in maxRSS for the benchmarks with the biggest slowdown
See that the two benchmarks pi_digits5 and zarith_pi which showed the biggest slowdowns also seem to use much less memory.

This paints a picture of the bugfix working correctly: when the GC is too lazy (as in trunk) the program runs fast but uses too much memory. With the fix, the GC works harder to maintain the user-chosen time/space trade-off. A slowdown of 50% means the GC is now using (more than) 33% of the running time, which is a bit high but not off the scale.

Note that pi_digits5 and zarith_pi are almost exactly the same program. Maybe one of them should be eliminated from sandmark.

Do we expect this PR to increase the major and minor GC counts? (Not that this number alone matters, as the running time + maxRSS is perhaps a better representation of the user-observable behaviour).

Major GC counts: definitely yes. Minor GC counts: since the major GC forces a minor at the start of its cycle, yes. It would be interesting to compare the absolute numbers here.

Parallel

On most benchmarks, this PR performs exactly the same. On test_decompress, there is a significant 2x speedup over trunk. I suspect that this benchmark exhibits the idle-domain behaviour that this benchmark fixes. This is excellent.

If I read the graphics correctly, test_decompress exhibits a slowdown rather than a speedup. I've looked at the raw numbers, and it seem to use much more memory as well. I'll need to investigate this.

kayceesrk · 2023-04-26T15:39:46Z

If I read the graphics correctly, test_decompress exhibits a slowdown rather than a speedup. I've looked at the raw numbers, and it seem to use much more memory as well. I'll need to investigate this

You are right. I was reading the result that I wanted to see. Good catch.

Other observations make sense to me.

sadiqj · 2023-04-27T15:37:56Z

runtime/domain.c

+
+  The explanation above applies if [sync] = 1. When [sync] = 0, no
+  synchronization happens, and we simply run the handler asynchronously on
+  all domains. We still hold the stw_leader field until we know that
+  every domain has run the handler, so another STW section cannot
+  interfere with this one.
+
 */
 int caml_try_run_on_all_domains_with_spin_work(
+  int sync,


I'm not entirely sure we want to change this interface for this one use case. This function looks deceptively simple to use but actually ended up being a source of many bugs.

I think there is one in the use here: https://github.com/ocaml/ocaml/pull/11903/files#diff-67115925103982a8ebeb085cfab5ef31a182c9a442bc51e053934364d3750dafR1637 . caml_try_run_on_all_domains_with_spin_work can just fail to actually run the function if it can't be the stw_leader or take the all_domains_lock. This means that we might have a requested_global_major_slice whose actions actually get ignored.

As an alternative we could support the use-case here with something simpler:

while(1) { handle_interrupts(); if( caml_plat_try_lock(&all_domains_lock) ) { /* iterate through stw_domains.participating_domains and set requested_major_slice to 1 */ caml_plat_unlock(&all_domains_lock); } }

The only race is when you're about to enter a major slice anyway, so doesn't make a difference.

I was aware of this when I wrote the code. Maybe I should leave requested_global_major_slice alone in this case. The idea was that it's not a correctness bug: if it happens infrequently enough, we're still good.

For this:

while(1) { handle_interrupts(); if( caml_plat_try_lock(&all_domains_lock) ) { /* iterate through stw_domains.participating_domains and set requested_major_slice to 1 */ caml_plat_unlock(&all_domains_lock); } }

This is not enough: we also have to interrupt all other domains. And signal the backup threads of the idle domains. I started writing that code, then I realized I was rewriting a large part of caml_try_run_on_all_domains_with_spin_work. At that point I decided to reuse it instead of rolling my own buggy version.

sadiqj

The changes look good other than the note I already made about synchronisation.

I don't think it should block this PR (which blocks 5.1) but I would be really good to increase the documentation in the code for the actual computations that form the GC pacing. The little explainer in major_gc.c helps but we could probably go further. It took me a while to piece together the different parts.

I'm also not sure I understand why we take the max of alloc_work/extra_work/dependent_work - shouldn't extra and dependent be considered part of the heap and be included in a combination calculation of work to be done?

… caml_try_run_on_all_domains may fail

damiendoligez · 2023-05-22T12:10:18Z

Rebased. I have checked by hand that the test_decompress speedup is OK. @kayceesrk do you want to relaunch the parallel tests? IMO it's not worth it unless it is a negligible amount of work.

kayceesrk · 2023-05-23T03:40:41Z

I've also manually tested the PR on test_decompress and I confirm the speedup is fine. I'm ok to merge this PR.

I have also relaunched the parallel tests: ocaml-bench/sandmark-nightly-config@d5e19db. The results should be available in a day.

kayceesrk · 2023-05-24T13:51:51Z

Sandmark results confirm that the speedup is fine. Here is the sandmark run on parallel benchmarks that compares the runs before and after the fix.

Fix idle domain gc (cherry picked from commit dba3301)

Octachron · 2023-05-25T13:17:42Z

Cherry-picked on 5.1 (613f96d)

kayceesrk self-assigned this Jan 19, 2023

damiendoligez assigned stedolan Jan 24, 2023

kayceesrk added the runtime-system label Jan 27, 2023

talex5 mentioned this pull request Feb 25, 2023

Out-of-memory crash when using a second domain #12042

Closed

gasche reviewed Feb 25, 2023

View reviewed changes

tmcgilchrist mentioned this pull request Feb 28, 2023

energy monitoring of the ocaml.org cluster ocaml/infrastructure#25

Open

7 tasks

damiendoligez mentioned this pull request Mar 23, 2023

fix overcounting of minor collections #12132

Merged

damiendoligez force-pushed the fix-idle-domain-gc branch from ced42cf to 7440f24 Compare April 11, 2023 12:33

damiendoligez marked this pull request as ready for review April 11, 2023 13:00

kayceesrk reviewed Apr 12, 2023

View reviewed changes

jmid mentioned this pull request Apr 13, 2023

[ocaml5-issue] Out_channel Lin test takes very long on macOS ocaml-multicore/multicoretests#321

Closed

xavierleroy added this to the 5.1 milestone Apr 21, 2023

kayceesrk approved these changes Apr 22, 2023

View reviewed changes

kayceesrk mentioned this pull request Apr 22, 2023

Add Fix idle GC branch ocaml-bench/sandmark-nightly-config#19

Merged

sadiqj reviewed Apr 27, 2023

View reviewed changes

sadiqj approved these changes May 1, 2023

View reviewed changes

damiendoligez and others added 10 commits May 22, 2023 13:50

try with a global work-to-do counter

6b01395

try to do more work on major slices

3fd3a18

simplified accounting

b978442

add missing accounting and clean up debug code

e6370d5

more simplification

3489fd1

accounting without locks

01c380d

Refactor budget logic in major_collection_slice

6d8078e

Refactor budget logic in major_collection_slice some more

b9a2cfc

more GC logging and fix accounting bug

2395e3d

when triggering a global major slice, take into account the fact that…

b4f7aff

… caml_try_run_on_all_domains may fail

damiendoligez force-pushed the fix-idle-domain-gc branch from 94c4a4b to b4f7aff Compare May 22, 2023 12:07

fix overlong line and add Changes entry

db4d787

adjust test constant

03ce04a

kayceesrk added the merge-me label May 24, 2023

Octachron merged commit dba3301 into ocaml:trunk May 25, 2023

Octachron added a commit that referenced this pull request May 25, 2023

Merge pull request #11903 from damiendoligez/fix-idle-domain-gc

613f96d

Fix idle domain gc (cherry picked from commit dba3301)

Octachron mentioned this pull request May 25, 2023

Simplification with atomic_ helpers #12192

Merged

Octachron mentioned this pull request May 26, 2023

testsuite: unlock mutexes in tests before exit #12270

Merged

OlivierNicole mentioned this pull request Jun 6, 2023

Add ThreadSanitizer support #12114

Merged

5 tasks

fabbing mentioned this pull request Jun 9, 2023

Add @damiendoligez sorttest (idle domain GC) benchmark ocaml-bench/sandmark#460

Closed

gasche mentioned this pull request Nov 17, 2023

Performance regression with multiple domains between 5.0 and 5.1 #12460

Open

Sudha247 mentioned this pull request Nov 29, 2023

Executor_pool docs ocaml-multicore/eio#650

Closed

damiendoligez deleted the fix-idle-domain-gc branch March 7, 2024 16:10

gasche mentioned this pull request Jul 15, 2024

OCaml 5 performance regression for unmarshal-heavy workloads #13300

Closed

kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Oct 16, 2024

kayceesrk mentioned this pull request Jul 3, 2025

"Mark-delay" performance improvement to major GC #13580

Merged

		domain_state->dependent_allocated = 0;

		domain_state->major_work_done_between_slices = 0;

Conversation

damiendoligez commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GC work accounting

extra major slices

benchmark results

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

gasche Feb 25, 2023

Choose a reason for hiding this comment

Uh oh!

gasche commented Feb 28, 2023

Uh oh!

damiendoligez commented Feb 28, 2023

Uh oh!

kayceesrk commented Mar 1, 2023

Uh oh!

damiendoligez commented Mar 2, 2023

Uh oh!

damiendoligez commented Apr 11, 2023

Uh oh!

gasche commented Apr 11, 2023

Uh oh!

damiendoligez commented Apr 11, 2023

Uh oh!

kit-ty-kate commented Apr 11, 2023

Uh oh!

damiendoligez commented Apr 12, 2023

Uh oh!

kayceesrk Apr 12, 2023

Choose a reason for hiding this comment

Uh oh!

kayceesrk commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Apr 21, 2023

Uh oh!

kayceesrk left a comment

Choose a reason for hiding this comment

Uh oh!

kayceesrk commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sequential

Normalised minor collection count

Normalised major collection count

Parallel

Uh oh!

damiendoligez commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parallel

Uh oh!

kayceesrk commented Apr 26, 2023

Uh oh!

sadiqj Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

damiendoligez May 16, 2023

Choose a reason for hiding this comment

Uh oh!

sadiqj left a comment

Choose a reason for hiding this comment

Uh oh!

damiendoligez commented May 22, 2023

Uh oh!

kayceesrk commented May 23, 2023

Uh oh!

kayceesrk commented May 24, 2023

Uh oh!

Octachron commented May 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

damiendoligez commented Jan 17, 2023 •

edited

Loading

kayceesrk commented Apr 12, 2023 •

edited

Loading

kayceesrk commented Apr 24, 2023 •

edited

Loading

damiendoligez commented Apr 26, 2023 •

edited

Loading