Skip to content

Fix idle domain gc#11903

Merged
Octachron merged 22 commits intoocaml:trunkfrom
damiendoligez:fix-idle-domain-gc
May 25, 2023
Merged

Fix idle domain gc#11903
Octachron merged 22 commits intoocaml:trunkfrom
damiendoligez:fix-idle-domain-gc

Conversation

@damiendoligez
Copy link
Copy Markdown
Member

@damiendoligez damiendoligez commented Jan 17, 2023

Trying to fix #11589 (and maybe #11548). Two changes here:

  1. GC work accounting and distribution becomes global.
  2. extra major slices are done globally instead of locally.

The benchmark code is here: https://gist.github.com/damiendoligez/86a3afa6c6a594e0274d6194d91457b7

GC work accounting

Major allocations are counted globally and the counts are evenly distributed among active domains. These counts tell the major GC slices how much work they have to do. The main difficulty is that some domains may run out of major GC work for the current cycle before others. Then they cannot do any more GC work before the global synchronization at the end of the GC cycle (i.e. the STW event that starts the next cycle). In this case we have to redistribute their counts back to the domains that still have something to do.
This accounting tries to guarante that the global speed of major collection keeps up with the speed of allocation.

extra major slices

When we allocate directly in the major heap, we sometimes have to trigger major GC slices in addition to the ones that are automatically run between minor collections. In trunk, this is done only in the allocating domain, but that's not enough: in the benchmark program, the allocating domain has already run out of GC work and these slices do nothing and allocation gets ahead of collection. In order to do these global slices, we have to send a signal to all other domains, but we don't need synchronization, do I modified the STW code to add an asynchronous mode.

benchmark results

test max heap size (MB) minor collections major collections
trunk sequential 30.5 125 116
trunk with idle domain 694.0 121 6
this PR sequential 52.6 112 112
this PR with idle domain 99.9 194 47

We can clearly see some improvement, but I still have to investigate:

  • why the number of minor collections changes so much
  • why the sequential test of this PR uses much more memory than trunk
  • what is still holding up the major GC

Fixes #11589
Fixes #12042

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I was randomly passing by and have a question.)

}
/* reset work counters */
caml_plat_lock (&accounting_lock);
caml_enumerate_participating_domains (&reset_domain_account);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, aren't the arguments int participating_count, caml_domain_state **participating precisely supposed to let us do this (iterate on all participating domains)? If it is enough to access the stw_domains globals, why do we have them in the STW API?

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 28, 2023

What is the status of this PR? It looks like it does fix or at least greatly alleviate memory-consumption issues. Do we want to eventually get it merged? What do we need to do?

@damiendoligez
Copy link
Copy Markdown
Member Author

I'm still working on this. I'm not satisfied with the increase of memory consumption in the sequential case.

@kayceesrk
Copy link
Copy Markdown
Contributor

@damiendoligez I wondered if you had an idealised model for major slice scheduling in the multi-domain case. One thing we can do is to use olly to generate the traces and analyse run post-facto to see how close this PR gets to this ideal. Perfetto has a SQL-based trace analysis framework where we can explore the trace by asking queries.

IINM, runtime events currently does not report counter events for major heap sizes and estimated live data, which would be a pre-requisite for feeding into this idealised model. CC @sadiqj.

@damiendoligez
Copy link
Copy Markdown
Member Author

I don't have a detailed model, I only know that, going back to https://gist.github.com/stedolan/52c13d6b1a30276db31ca98683c8db16, with the default overhead parameter we need to do about 5.2 units of work (the average of m and s) for each word allocated. When the GC does much less than that, the heap will get unacceptably large. The tricky part is distributing this work among the domains that do have some GC work available.

@damiendoligez
Copy link
Copy Markdown
Member Author

I have much simplified the accounting logic: there is now a global counter of how much needs to be done, and each major slice tries to do that much work and decreases the global counter by how much work was actually done.

We still need to trigger major slices on other domains when the current domain runs out of things to do. This means we sometimes have all domains doing a major slice at the same time. There is a work-sharing mechanism to handle this case without doing N times the total amount of work.

The results are now better than trunk, even without any idle thread, but they still get worse when there are idle threads:

This PR:

spawn heap max (MB) minor GC major GC work-to-alloc backlog
0 15.62 434 434 4.00 0.00
1 26.53 231 231 3.08 0.23
2 30.99 180 180 2.88 0.28
3 31.12 165 165 2.82 0.29
5 31.15 159 159 2.80 0.30
10 31.71 160 159 2.86 0.28

Trunk:

spawn heap max (MB) minor GC major GC work-to-alloc
0 30.47 130 126 3.31

The work-to-alloc column gives the ratio of GC work done to number of words allocated in the major heap. When the GC's control loop works perfectly, this is 4 (assuming default GC settings). The backlog colum gives the amount of work that is left in the counter at the end of the program over the total amount of work.

I tried to increase the running time of the test with 0 or 1 idle thread:

no idle thread:

steps heap max (MB) minor GC major GC work-to-alloc backlog
500 15.62 434 434 4.00 0.00
1000 30.27 918 918 4.00 0.00
1500 46.14 1409 1409 4.00 0.00
2000 61.40 1902 1902 4.00 0.00

one idle thread:

steps heap max (MB) minor GC major GC work-to-alloc backlog
500 26.53 231 231 3.08 0.23
1000 54.92 473 473 3.03 0.24
1500 89.34 716 716 3.00 0.25
2000 103.00 960 960 3.01 0.25

The work-to-alloc ratio and residual backlog seem to be stable when the benchmark runs longer.

I still need to do a smoke check with sandmark and check whether this will fix #11548.

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 11, 2023

If I read your numbers right, in the sequential case (spawn=0) the benchmark runs many more GC slices in your PR than in trunk: 434 with your PR, 126 with trunk. You say that this is better behavior because the work-to-alloc hits the intended target of 4.0. But is that good? I suppose that there is no issue with throughput because each slice doesless work?

(If I understand correctly, the impact on throughput is approximated by the work-to-alloc ratio, so 4 instead of 3.3 means slightly more time spent in the major GC but not much more, and we get to halve the memory usage in return.)

@damiendoligez damiendoligez marked this pull request as ready for review April 11, 2023 13:00
@damiendoligez
Copy link
Copy Markdown
Member Author

If I read your numbers right, in the sequential case (spawn=0) the benchmark runs many more GC slices in your PR than in trunk: 434 with your PR, 126 with trunk. You say that this is better behavior because the work-to-alloc hits the intended target of 4.0. But is that good? I suppose that there is no issue with throughput because each slice doesless work?

It's actually a number of major GC cycles, not slices. Note that this program is quite atypical: it allocates more in the major heap than in the minor heap (which explains why the number of minor GCs is equal to the number of major cycles: each major cycle forces a minor GC). To answer your question, I think it's good because it lets the GC reach the target work-to-alloc ratio, which trunk doesn't.

@kit-ty-kate
Copy link
Copy Markdown
Member

I can confirm that this PR in its current state also fixes #12042

@damiendoligez
Copy link
Copy Markdown
Member Author

@stedolan @kayceesrk : I think this is ready for review and should be integrated in 5.1, although it's likely that some further improvements are possible. They will be left for a later PR.

domain_state->dependent_allocated = 0;

domain_state->major_work_done_between_slices = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other newly introduced domain-local variables also need to be initialised.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Apr 12, 2023

The changes are surprisingly simple and look good to me. I should say that I haven't fully checked whether the new slice computation does the right thing. This is something that @stedolan may want to do.

One component that I've been wondering about is

We still need to trigger major slices on other domains when the current domain runs out of things to do.

We have to do something similar for triggering major slices when the first domain in a minor cycle uses half of its minor heap: #11750. But the mechanisms are different. #11750 interrupts all of the domains by obtaining the all_domains_lock and sets the interrupt word to -1. See advance_global_major_slice_epoch this PR uses an asynchronous version of caml_try_run_on_all_domains_async in order to trigger major slice globally.

I wondered whether both of these can be expressed using a similar mechanism. Have you thought about this @damiendoligez?

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 21, 2023

@kayceesrk we discussed this PR at the triaging meeting today. We would like to merge some version of it in 5.1 if possible. If you think that you wouldn't have the time to go to the finish line in the next few weeks/months, let us know and we will look for another potential reviewer.

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the computation for updating gc work, and they look correct to me.

I would be curious to see Sandmark results on this. Has this already been done @damiendoligez?

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Apr 24, 2023

The sandmark results are available now. Only the results from the turing machine (28 cores) is available. The 128 core navajo seems to be down.

Sequential

Sequential results are here.

There are a few "differences" from the trunk version. In terms of running time, there are a few regressions:

image

It may be worthwhile investigating pi_digits5 benchmark which shows 50% slowdown.

There's a drop in maxRSS for the benchmarks with the biggest slowdown

image

See that the two benchmarks pi_digits5 and zarith_pi which showed the biggest slowdowns also seem to use much less memory.

This PR also make the program do a lot more major and minor GCs.

Normalised minor collection count

image

Normalised major collection count

image

Do we expect this PR to increase the major and minor GC counts? (Not that this number alone matters, as the running time + maxRSS is perhaps a better representation of the user-observable behaviour).

Parallel

The parallel speedup results are here.

On most benchmarks, this PR performs exactly the same. On test_decompress, there is a significant 2x speedup over trunk. I suspect that this benchmark exhibits the idle-domain behaviour that this benchmark fixes. This is excellent.

@damiendoligez
Copy link
Copy Markdown
Member Author

damiendoligez commented Apr 26, 2023

@kayceesrk

The sandmark results are available now. Only the results from the turing machine (28 cores) is available. The 128 core navajo seems to be down.

Thanks for running the benchmarks.

It may be worthwhile investigating pi_digits5 benchmark which shows 50% slowdown.
There's a drop in maxRSS for the benchmarks with the biggest slowdown
See that the two benchmarks pi_digits5 and zarith_pi which showed the biggest slowdowns also seem to use much less memory.

This paints a picture of the bugfix working correctly: when the GC is too lazy (as in trunk) the program runs fast but uses too much memory. With the fix, the GC works harder to maintain the user-chosen time/space trade-off. A slowdown of 50% means the GC is now using (more than) 33% of the running time, which is a bit high but not off the scale.

Note that pi_digits5 and zarith_pi are almost exactly the same program. Maybe one of them should be eliminated from sandmark.

Do we expect this PR to increase the major and minor GC counts? (Not that this number alone matters, as the running time + maxRSS is perhaps a better representation of the user-observable behaviour).

Major GC counts: definitely yes. Minor GC counts: since the major GC forces a minor at the start of its cycle, yes. It would be interesting to compare the absolute numbers here.

Parallel

On most benchmarks, this PR performs exactly the same. On test_decompress, there is a significant 2x speedup over trunk. I suspect that this benchmark exhibits the idle-domain behaviour that this benchmark fixes. This is excellent.

If I read the graphics correctly, test_decompress exhibits a slowdown rather than a speedup. I've looked at the raw numbers, and it seem to use much more memory as well. I'll need to investigate this.

@kayceesrk
Copy link
Copy Markdown
Contributor

If I read the graphics correctly, test_decompress exhibits a slowdown rather than a speedup. I've looked at the raw numbers, and it seem to use much more memory as well. I'll need to investigate this

You are right. I was reading the result that I wanted to see. Good catch.

Other observations make sense to me.

Comment on lines +1368 to +1377

The explanation above applies if [sync] = 1. When [sync] = 0, no
synchronization happens, and we simply run the handler asynchronously on
all domains. We still hold the stw_leader field until we know that
every domain has run the handler, so another STW section cannot
interfere with this one.

*/
int caml_try_run_on_all_domains_with_spin_work(
int sync,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure we want to change this interface for this one use case. This function looks deceptively simple to use but actually ended up being a source of many bugs.

I think there is one in the use here: https://github.com/ocaml/ocaml/pull/11903/files#diff-67115925103982a8ebeb085cfab5ef31a182c9a442bc51e053934364d3750dafR1637 . caml_try_run_on_all_domains_with_spin_work can just fail to actually run the function if it can't be the stw_leader or take the all_domains_lock. This means that we might have a requested_global_major_slice whose actions actually get ignored.

As an alternative we could support the use-case here with something simpler:

while(1) {
  handle_interrupts();
  if( caml_plat_try_lock(&all_domains_lock) ) {
   /* iterate through stw_domains.participating_domains and set requested_major_slice to 1 */
   caml_plat_unlock(&all_domains_lock);
  }
}

The only race is when you're about to enter a major slice anyway, so doesn't make a difference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was aware of this when I wrote the code. Maybe I should leave requested_global_major_slice alone in this case. The idea was that it's not a correctness bug: if it happens infrequently enough, we're still good.

For this:

while(1) {
  handle_interrupts();
  if( caml_plat_try_lock(&all_domains_lock) ) {
   /* iterate through stw_domains.participating_domains and set requested_major_slice to 1 */
   caml_plat_unlock(&all_domains_lock);
  }
}

This is not enough: we also have to interrupt all other domains. And signal the backup threads of the idle domains. I started writing that code, then I realized I was rewriting a large part of caml_try_run_on_all_domains_with_spin_work. At that point I decided to reuse it instead of rolling my own buggy version.

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good other than the note I already made about synchronisation.

I don't think it should block this PR (which blocks 5.1) but I would be really good to increase the documentation in the code for the actual computations that form the GC pacing. The little explainer in major_gc.c helps but we could probably go further. It took me a while to piece together the different parts.

I'm also not sure I understand why we take the max of alloc_work/extra_work/dependent_work - shouldn't extra and dependent be considered part of the heap and be included in a combination calculation of work to be done?

@damiendoligez
Copy link
Copy Markdown
Member Author

Rebased. I have checked by hand that the test_decompress speedup is OK. @kayceesrk do you want to relaunch the parallel tests? IMO it's not worth it unless it is a negligible amount of work.

@kayceesrk
Copy link
Copy Markdown
Contributor

I've also manually tested the PR on test_decompress and I confirm the speedup is fine. I'm ok to merge this PR.

I have also relaunched the parallel tests: ocaml-bench/sandmark-nightly-config@d5e19db. The results should be available in a day.

@kayceesrk
Copy link
Copy Markdown
Contributor

Sandmark results confirm that the speedup is fine. Here is the sandmark run on parallel benchmarks that compares the runs before and after the fix.

@Octachron Octachron merged commit dba3301 into ocaml:trunk May 25, 2023
Octachron added a commit that referenced this pull request May 25, 2023
Fix idle domain gc

(cherry picked from commit dba3301)
@Octachron
Copy link
Copy Markdown
Member

Cherry-picked on 5.1 (613f96d)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-me Performance PR or issues affecting runtime performance of the compiled programs runtime-system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Out-of-memory crash when using a second domain Idle domain slows down major GC cycles

8 participants