Conversation
gasche
left a comment
There was a problem hiding this comment.
(I was randomly passing by and have a question.)
runtime/major_gc.c
Outdated
| } | ||
| /* reset work counters */ | ||
| caml_plat_lock (&accounting_lock); | ||
| caml_enumerate_participating_domains (&reset_domain_account); |
There was a problem hiding this comment.
I'm confused, aren't the arguments int participating_count, caml_domain_state **participating precisely supposed to let us do this (iterate on all participating domains)? If it is enough to access the stw_domains globals, why do we have them in the STW API?
|
What is the status of this PR? It looks like it does fix or at least greatly alleviate memory-consumption issues. Do we want to eventually get it merged? What do we need to do? |
|
I'm still working on this. I'm not satisfied with the increase of memory consumption in the sequential case. |
|
@damiendoligez I wondered if you had an idealised model for major slice scheduling in the multi-domain case. One thing we can do is to use olly to generate the traces and analyse run post-facto to see how close this PR gets to this ideal. Perfetto has a SQL-based trace analysis framework where we can explore the trace by asking queries. IINM, runtime events currently does not report counter events for major heap sizes and estimated live data, which would be a pre-requisite for feeding into this idealised model. CC @sadiqj. |
|
I don't have a detailed model, I only know that, going back to https://gist.github.com/stedolan/52c13d6b1a30276db31ca98683c8db16, with the default overhead parameter we need to do about 5.2 units of work (the average of m and s) for each word allocated. When the GC does much less than that, the heap will get unacceptably large. The tricky part is distributing this work among the domains that do have some GC work available. |
ced42cf to
7440f24
Compare
|
I have much simplified the accounting logic: there is now a global counter of how much needs to be done, and each major slice tries to do that much work and decreases the global counter by how much work was actually done. We still need to trigger major slices on other domains when the current domain runs out of things to do. This means we sometimes have all domains doing a major slice at the same time. There is a work-sharing mechanism to handle this case without doing N times the total amount of work. The results are now better than trunk, even without any idle thread, but they still get worse when there are idle threads: This PR:
Trunk:
The work-to-alloc column gives the ratio of GC work done to number of words allocated in the major heap. When the GC's control loop works perfectly, this is 4 (assuming default GC settings). The backlog colum gives the amount of work that is left in the counter at the end of the program over the total amount of work. I tried to increase the running time of the test with 0 or 1 idle thread: no idle thread:
one idle thread:
The work-to-alloc ratio and residual backlog seem to be stable when the benchmark runs longer. I still need to do a smoke check with sandmark and check whether this will fix #11548. |
|
If I read your numbers right, in the sequential case (spawn=0) the benchmark runs many more GC slices in your PR than in trunk: 434 with your PR, 126 with trunk. You say that this is better behavior because the work-to-alloc hits the intended target of 4.0. But is that good? I suppose that there is no issue with throughput because each slice doesless work? (If I understand correctly, the impact on throughput is approximated by the work-to-alloc ratio, so 4 instead of 3.3 means slightly more time spent in the major GC but not much more, and we get to halve the memory usage in return.) |
It's actually a number of major GC cycles, not slices. Note that this program is quite atypical: it allocates more in the major heap than in the minor heap (which explains why the number of minor GCs is equal to the number of major cycles: each major cycle forces a minor GC). To answer your question, I think it's good because it lets the GC reach the target work-to-alloc ratio, which trunk doesn't. |
|
I can confirm that this PR in its current state also fixes #12042 |
|
@stedolan @kayceesrk : I think this is ready for review and should be integrated in 5.1, although it's likely that some further improvements are possible. They will be left for a later PR. |
| domain_state->dependent_allocated = 0; | ||
|
|
||
| domain_state->major_work_done_between_slices = 0; | ||
|
|
There was a problem hiding this comment.
The other newly introduced domain-local variables also need to be initialised.
|
The changes are surprisingly simple and look good to me. I should say that I haven't fully checked whether the new slice computation does the right thing. This is something that @stedolan may want to do. One component that I've been wondering about is
We have to do something similar for triggering major slices when the first domain in a minor cycle uses half of its minor heap: #11750. But the mechanisms are different. #11750 interrupts all of the domains by obtaining the I wondered whether both of these can be expressed using a similar mechanism. Have you thought about this @damiendoligez? |
|
@kayceesrk we discussed this PR at the triaging meeting today. We would like to merge some version of it in 5.1 if possible. If you think that you wouldn't have the time to go to the finish line in the next few weeks/months, let us know and we will look for another potential reviewer. |
kayceesrk
left a comment
There was a problem hiding this comment.
I've gone through the computation for updating gc work, and they look correct to me.
I would be curious to see Sandmark results on this. Has this already been done @damiendoligez?
|
The sandmark results are available now. Only the results from the SequentialSequential results are here. There are a few "differences" from the trunk version. In terms of running time, there are a few regressions: It may be worthwhile investigating There's a drop in maxRSS for the benchmarks with the biggest slowdown See that the two benchmarks This PR also make the program do a lot more major and minor GCs. Normalised minor collection countNormalised major collection countDo we expect this PR to increase the major and minor GC counts? (Not that this number alone matters, as the running time + maxRSS is perhaps a better representation of the user-observable behaviour). ParallelThe parallel speedup results are here. On most benchmarks, this PR performs exactly the same. On |
Thanks for running the benchmarks.
This paints a picture of the bugfix working correctly: when the GC is too lazy (as in trunk) the program runs fast but uses too much memory. With the fix, the GC works harder to maintain the user-chosen time/space trade-off. A slowdown of 50% means the GC is now using (more than) 33% of the running time, which is a bit high but not off the scale. Note that pi_digits5 and zarith_pi are almost exactly the same program. Maybe one of them should be eliminated from sandmark.
Major GC counts: definitely yes. Minor GC counts: since the major GC forces a minor at the start of its cycle, yes. It would be interesting to compare the absolute numbers here.
If I read the graphics correctly, |
You are right. I was reading the result that I wanted to see. Good catch. Other observations make sense to me. |
|
|
||
| The explanation above applies if [sync] = 1. When [sync] = 0, no | ||
| synchronization happens, and we simply run the handler asynchronously on | ||
| all domains. We still hold the stw_leader field until we know that | ||
| every domain has run the handler, so another STW section cannot | ||
| interfere with this one. | ||
|
|
||
| */ | ||
| int caml_try_run_on_all_domains_with_spin_work( | ||
| int sync, |
There was a problem hiding this comment.
I'm not entirely sure we want to change this interface for this one use case. This function looks deceptively simple to use but actually ended up being a source of many bugs.
I think there is one in the use here: https://github.com/ocaml/ocaml/pull/11903/files#diff-67115925103982a8ebeb085cfab5ef31a182c9a442bc51e053934364d3750dafR1637 . caml_try_run_on_all_domains_with_spin_work can just fail to actually run the function if it can't be the stw_leader or take the all_domains_lock. This means that we might have a requested_global_major_slice whose actions actually get ignored.
As an alternative we could support the use-case here with something simpler:
while(1) {
handle_interrupts();
if( caml_plat_try_lock(&all_domains_lock) ) {
/* iterate through stw_domains.participating_domains and set requested_major_slice to 1 */
caml_plat_unlock(&all_domains_lock);
}
}
The only race is when you're about to enter a major slice anyway, so doesn't make a difference.
There was a problem hiding this comment.
I was aware of this when I wrote the code. Maybe I should leave requested_global_major_slice alone in this case. The idea was that it's not a correctness bug: if it happens infrequently enough, we're still good.
For this:
while(1) {
handle_interrupts();
if( caml_plat_try_lock(&all_domains_lock) ) {
/* iterate through stw_domains.participating_domains and set requested_major_slice to 1 */
caml_plat_unlock(&all_domains_lock);
}
}
This is not enough: we also have to interrupt all other domains. And signal the backup threads of the idle domains. I started writing that code, then I realized I was rewriting a large part of caml_try_run_on_all_domains_with_spin_work. At that point I decided to reuse it instead of rolling my own buggy version.
sadiqj
left a comment
There was a problem hiding this comment.
The changes look good other than the note I already made about synchronisation.
I don't think it should block this PR (which blocks 5.1) but I would be really good to increase the documentation in the code for the actual computations that form the GC pacing. The little explainer in major_gc.c helps but we could probably go further. It took me a while to piece together the different parts.
I'm also not sure I understand why we take the max of alloc_work/extra_work/dependent_work - shouldn't extra and dependent be considered part of the heap and be included in a combination calculation of work to be done?
… caml_try_run_on_all_domains may fail
94c4a4b to
b4f7aff
Compare
|
Rebased. I have checked by hand that the |
|
I've also manually tested the PR on I have also relaunched the parallel tests: ocaml-bench/sandmark-nightly-config@d5e19db. The results should be available in a day. |
|
Sandmark results confirm that the speedup is fine. Here is the sandmark run on parallel benchmarks that compares the runs before and after the fix. |
Fix idle domain gc (cherry picked from commit dba3301)
|
Cherry-picked on 5.1 (613f96d) |




Trying to fix #11589 (and maybe #11548). Two changes here:
The benchmark code is here: https://gist.github.com/damiendoligez/86a3afa6c6a594e0274d6194d91457b7
GC work accounting
Major allocations are counted globally and the counts are evenly distributed among active domains. These counts tell the major GC slices how much work they have to do. The main difficulty is that some domains may run out of major GC work for the current cycle before others. Then they cannot do any more GC work before the global synchronization at the end of the GC cycle (i.e. the STW event that starts the next cycle). In this case we have to redistribute their counts back to the domains that still have something to do.
This accounting tries to guarante that the global speed of major collection keeps up with the speed of allocation.
extra major slices
When we allocate directly in the major heap, we sometimes have to trigger major GC slices in addition to the ones that are automatically run between minor collections. In trunk, this is done only in the allocating domain, but that's not enough: in the benchmark program, the allocating domain has already run out of GC work and these slices do nothing and allocation gets ahead of collection. In order to do these global slices, we have to send a signal to all other domains, but we don't need synchronization, do I modified the STW code to add an asynchronous mode.
benchmark results
We can clearly see some improvement, but I still have to investigate:
why the number of minor collections changes so muchwhy the sequential test of this PR uses much more memory than trunkwhat is still holding up the major GCFixes #11589
Fixes #12042