Domain local allocation buffers by sadiqj · Pull Request #508 · ocaml-multicore/ocaml-multicore

sadiqj · 2021-03-23T18:29:17Z

This PR extends #484

We now resize the global minor heap lazily based on the number of running domains participating in a minor collection
Memory is only committed/released when the number of running domains changes, which should be rarely at steady state
Domain local allocation buffers are set to 1/8th of the global minor per domain heap size and are replenished as necessary
It has been rebased on to 4.12

There is currently one failing test locally, timing.ml. I'm still trying to figure out what's going on with that.

abbysmal

Thanks for this work!
I think the changes are fine, I found a thing or two that looks weird, but overall the core logic looks sane to me.
The minor heap setting bits are a bit tricky, I will address these as I take over the branch. (and I think the testsuite deserves more coverage for these)

Naming wise, I think things make sense now, there's a few config options we may want to change, I will take care of this later as well.

abbysmal · 2021-04-01T13:07:36Z

runtime/minor_gc.c

+  // Calculate the size of the existing mapping
+  int global_minor_heap_size = caml_global_minor_heap_limit - caml_global_minor_heap_start;
+  // Now using the number of participating domains, we calculate the new size
+  int new_global_minor_heap_size = participating_domains*Bsize_wsize(global_minor_heap_wsz_per_domain) + caml_global_minor_heap_start;


I think adding caml_global_minor_heap_start wasn't meant here
It just works because everything is page aligned (and thus, is effectively adding zero to the number).

abbysmal · 2021-04-01T13:08:31Z

runtime/minor_gc.c

-  }
+  global_minor_heap_wsz_per_domain = wsize;
+
+  if (domain_state->young_ptr != domain_state->young_end) caml_minor_collection ();


I think we should always trigger a minor collection here, there are conditions where a minor collection won't be hit on next gc poll

gasche · 2021-04-09T17:00:49Z

runtime/domain.c

+    /* setting young_limit and young_ptr to minor_heaps_base
+       to trigger minor_heaps reallocation on GC poll */
+    domain_state->young_start = (char*)caml_global_minor_heap_start;
+    domain_state->young_end = (char*)caml_global_minor_heap_start;


the minor_heaps_base naming in the comment is now out of date

ctk21 · 2021-04-14T12:13:20Z

runtime/caml/minor_gc.h

-CAMLextern void caml_minor_collection (void);
+CAMLextern asize_t global_minor_heap_wsz_per_domain;

-#ifdef CAML_INTERNALS


Is the removal of this CAML_INTERNALS guard intentional?
Aren't some of these functions meant to be guarded to not be visible elsewhere.

ctk21 · 2021-04-14T12:32:10Z

runtime/domain.c

-      goto reallocate_minor_heap_failure;
+    /* setting young_limit and young_ptr to minor_heaps_base
+       to trigger minor_heaps reallocation on GC poll */
+    domain_state->young_start = (char*)caml_global_minor_heap_start;


This function create_domain can execute in parallel with STW segments but the domain is not yet part of the STW. It will be brittle to read caml_global_minor_heap_start multiple times should we ever change how that variable is assigned.
Is there any reason to read this variable at all? Can't we just set all the young_* values to a suitable null value and trigger the collection we need when the values are used?

ctk21 · 2021-04-14T12:36:16Z

runtime/domain.c

+
+    young_limit = atomic_load_acq((atomic_uintnat*)&domain_state->young_limit);
+    if( young_limit != INTERRUPT_MAGIC ) {
+      atomic_compare_exchange_strong((atomic_uintnat*)&domain_state->young_limit, &young_limit, caml_global_minor_heap_start);


I think this snippet to update the young_limit should use caml_update_young_limit

ctk21 · 2021-04-14T12:46:26Z

runtime/domain.c

+    // Check if our minor heap is full. If it is then we need to try to grab a
+    // new one from the global minor heap.
+    if( domain_minor_heap_full && !need_minor_gc ) {
+      uintnat global_ptr =


Is there a reason why we need to do this check with the global_ptr? We have to deal with that case inside caml_replenish_minor_heap already.

I didn't understand why the body of this if block wasn't just:

/* try to get a new minor heap */ if(!caml_replenish_minor_heap()) { need_minor_gc = 1; }

Good catch, this case can be simplified indeed.

ctk21 · 2021-04-14T12:49:51Z

runtime/domain.c

 static void caml_poll_gc_work()
 {
  CAMLalloc_point_here;
+  { // No GC in this block


I didn't understand this comment, as there is a call to caml_empty_minor_heaps_once inside the block!

I'm unsure as well, I think it is only meant for the first part of this function? (and a Gc should only be triggered after this preambule, through the need_minor_gc flag?) @sadiqj

I've read that comment a few times and I don't understand what it means. Suggest we get rid of it.

ctk21 · 2021-04-14T12:54:03Z

runtime/minor_gc.c

+  uintnat cached_global_minor_heap_ptr;
+  uintnat new_alloc_ptr;
+
+  uintnat minor_buffer_wsize = global_minor_heap_wsz_per_domain >> 3;


I had a concern here that a user might somehow set the minor heap size small and then we would have a minor heap that couldn't hold even the smallest allocation. Do we need a guard here or in the initialization of global_minor_heap_wsz_per_domain?

ctk21 · 2021-04-14T13:06:47Z

runtime/minor_gc.c

+  caml_update_young_limit(cached_global_minor_heap_ptr);
+
+  domain_state->young_start = (char*)cached_global_minor_heap_ptr;
+  domain_state->young_end = (char*)(cached_global_minor_heap_ptr + minor_buffer_bsize);


Is this minor_buffer_bsize really correct?
Aren't there code paths here where the requested_buffer_size is not equivalent to minor_buffer_bsize?

ctk21 · 2021-04-14T13:13:49Z

runtime/minor_gc.c

               domain->state->id,
-               100.0 * (double)st.live_bytes / (double)minor_allocated_bytes,
-               (unsigned)(minor_allocated_bytes + 512)/1024, rewrite_successes, rewrite_failures);
+               100.0 * (double)st.live_bytes / (double)(domain_state->stat_minor_words),


I don't understand this change from minor_allocated_bytes to stat_minor_words.
The change makes the fraction in units bytes/words, rather than dimensionless; i.e. wouldn't we need to alter st.live_bytes to be live words.
The second change probably needs us to alter KB live to K words live.

This change is to take into account that now, we may have used more minor heap space than just one segment (the one currently set to the domain.)
If the domain cycled through many minor heap segment with DLABs, we keep track of these in the replenish cycle: every time we are done doing replenish, we set stat_minor_words to += the previous minor heap segment.
This effectively allow us to keep track of the previous minor heap segments used by this domain.
However it means we do need to use stat_minor_words instead of the more local minor_allocated_bytes (that only knows of the current minor heap segment.).
I think a simpler fix would be to change (domain_state->stat_minor_words) to Bsize_wsize(domain_state->stat_minor_words) ?

Co-authored-by: Engil <decorne.en@gmail.com>

… (default 8)

…ot + minor_buffer_bsize

…gment at least, fail otherwise

…t to force gc poll on next allocation path if replenish failed

… ending a minor cycle and at domain startup

…l take care of knowing if there's enough space or not.

…ion in gc_set, add gc_set test back

…ats as bytes, not words

…d be checked against the current caml_global_minor_heap_limit instead.

…ze per domain should be < Minox_heap_max)

…e performaned on uintnat, not in

…> Minor_heap_max * Num_domains

ctk21 · 2021-05-10T13:21:28Z

I have run this PR (on 20210504 with sandmark) using a large two socket Zen2 machine and some larger than normal benchmark workloads. These are the results I got:

There are some odd things here. We seem to cut the number of minor GCs quite a bit with some of the DLABs run, however in terms of execution time at higher core counts the PR does not seem to improve our runtime. In some instances it is running much slower (e.g. lu_decomposition, minilight, spectralnorm2).

abbysmal · 2021-05-10T13:33:03Z

Here's an excerpt of the bench results (ran by @shubhamkumar13 on godal (IIT machine)).
Provided results are for 256k sized minor heaps (speedup, time_sec and minor_collections)

ctk21 · 2021-05-11T08:58:58Z

Some experiments were conducted to see if the way we were handling the memory allocations of the minor heaps were responsible for the slowdowns on the big instances. The tags in the results below are:

4.12.0+domains+effects baseline
4.12.0+domains+effects_dlabs this PR
4.12.0+domains+effects_dlabs_malloc this PR but where we make use of glibc malloc to allocate all the minor heap and TLS areas. See this branch, results for this tree 427b02d.
4.12.0+domains+effects_dlabs_commit_decommit this PR but where we commit each minor heap segment as we get it and decommit on minor GC (this is not necessarily sensible, but we wanted to know how bad that is). See this branch, results for this tree 7e1ca31.

The one idea not tried, but might be is to have a 'minor heap stealing' strategy that would be more like HPC 'workstealing':

each domain has its own minor heap as in the existing baseline code
when a domain runs out of minor heap, it tries to steal from another domains minor heap
when all the minor heap space (or a significant fraction) is used, do the minor collection

It does beg the question as to if using non-local minor heap space is ever better on larger machines than just having an early collection.

ctk21 · 2021-05-12T08:19:55Z

I have run the benchmarks with a prefetching experiment:

4.12+domains+effects baseline (b23a41).
4.12+domains+effects+dlabs this PR (099a65).
4.12+domains+effects+dlabs_prefetch this branch with this tree (f25633)

Here are results with 'big' instances on a detuned Zen2:

The prefetching is having a beneficial effect. We even see a datapoint with binarytrees where dlabs has out performed at high core counts, offset against it underperforming in the same benchmark at lower core counts without prefetching. Worth noting that the baseline does not perform prefetching on its minor heap, although unclear if that is important.

(NB: sandmark is slightly updated in this run vs the previous for matrix_multiplication and floyd_warshall; apologies that it is not fixed code there)

ctk21 · 2021-05-21T13:52:03Z

We are closing this PR. However we have summarised where we got to here.

sadiqj force-pushed the dlabs_4.12 branch 7 times, most recently from ff04e58 to 3065b2e Compare March 30, 2021 16:44

sadiqj requested a review from abbysmal March 31, 2021 13:25

abbysmal reviewed Apr 1, 2021

View reviewed changes

gasche reviewed Apr 9, 2021

View reviewed changes

ctk21 reviewed Apr 14, 2021

View reviewed changes

ctk21 mentioned this pull request Apr 14, 2021

Thread local allocation buffers #484

Closed

abbysmal force-pushed the dlabs_4.12 branch from bf71705 to 6a01152 Compare April 15, 2021 13:45

This was referenced Apr 16, 2021

Fix small alloc retry #540

Merged

Addition of a parallel tak testcase #541

Merged

sadiqj and others added 8 commits April 20, 2021 13:47

Add domain-local allocation buffers

3121468

Co-authored-by: Engil <decorne.en@gmail.com>

fix up broken rebase

20bd273

runtime: always trigger a minor collection on explicit minor_heap resize

bdf1ac4

minor_gc: set_minor_heap_size is bsize, not wsize

0c39722

minor_gc: adding caml_global_minor_heap_start shouldnt be required here

656a7ce

runtime: add OCAMLRUNPARAM=d to change the minor heap segment divisor…

7efbc3d

… (default 8)

runtime: replenish should set young_end to + requested_buffer_size, n…

440ebd5

…ot + minor_buffer_bsize

runtime: use caml_update_young_limit in create_domain

1f6c9dd

abbysmal added 2 commits April 20, 2021 13:47

check in replenish that we have enough space to fit one minor heap se…

d97fcb3

…gment at least, fail otherwise

runtime: set young_ptr and young_limit to caml_global_minor_heap_star…

03418ae

…t to force gc poll on next allocation path if replenish failed

abbysmal force-pushed the dlabs_4.12 branch from d98c1c4 to 1876410 Compare April 20, 2021 12:04

abbysmal added 17 commits May 4, 2021 11:52

runtime: introduce caml_reset_young_fields to reset minor heap before…

2c15a2c

… ending a minor cycle and at domain startup

bad rebase fix

54adcbf

runtime: disable gc_set test for now

e85bda0

runtime: fix comment on caml_reset_young_fields in domain initialization

304f68a

runtime: add back CAML_INTERNALS to minor_gc.h

f8443a9

runtime: simplify replenish logic on caml_poll_gc_work: replenish wil…

a960b05

…l take care of knowing if there's enough space or not.

runtime: do alloc_generatic_table / by init_minor_heap_divisor, not 8

ae7be8b

runtime: more constraints on set_minor_heap_size, fix wsz/bsz convers…

100a603

…ion in gc_set, add gc_set test back

runtime: when exiting empty_minor_heap_promote, handle minor_words st…

b4c1397

…ats as bytes, not words

runtime: remove obsolete comment

5efe872

runtime: caml_global_minor_heap_end is not useful, and Is_young shoul…

646b3df

…d be checked against the current caml_global_minor_heap_limit instead.

runtime: Minor_heap_max is in words and size should be computed in bytes

97a72bf

runtime: guard agains't misconfiguration of minor heap (minor heap si…

1bea0bf

…ze per domain should be < Minox_heap_max)

runtime: in caml_adjust_global_minor_heap, size computations should b…

7d48e97

…e performaned on uintnat, not in

runtime: assert on new global minor heap size for it to not go above …

7058ba2

…> Minor_heap_max * Num_domains

runtime: break a line

15acb18

runtime: enforce Minor_heap_max in caml_set_minor_heap_size

099a659

abbysmal force-pushed the dlabs_4.12 branch from be60a07 to 099a659 Compare May 4, 2021 09:53

ctk21 closed this May 21, 2021

kayceesrk mentioned this pull request Jun 25, 2021

Domain-local allocation buffers - step 1: unified minor heap #398

Closed

kayceesrk mentioned this pull request Nov 8, 2021

Reduce minor heap reservation #732

Closed

kayceesrk mentioned this pull request Jan 13, 2022

generalize the Domain.DLS interface to split PRNG state for child domains ocaml/ocaml#10887

Merged

Conversation

sadiqj commented Mar 23, 2021

Uh oh!

abbysmal left a comment

Choose a reason for hiding this comment

Uh oh!

abbysmal Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ctk21 Apr 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ctk21 commented May 10, 2021

Uh oh!

abbysmal commented May 10, 2021

Uh oh!

ctk21 commented May 11, 2021

Uh oh!

ctk21 commented May 12, 2021

Uh oh!

ctk21 commented May 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abbysmal Apr 1, 2021 •

edited

Loading

ctk21 Apr 14, 2021 •

edited

Loading