Skip to content
This repository was archived by the owner on Jun 21, 2024. It is now read-only.

Domain local allocation buffers#508

Closed
sadiqj wants to merge 27 commits intoocaml-multicore:4.12+domains+effectsfrom
sadiqj:dlabs_4.12
Closed

Domain local allocation buffers#508
sadiqj wants to merge 27 commits intoocaml-multicore:4.12+domains+effectsfrom
sadiqj:dlabs_4.12

Conversation

@sadiqj
Copy link
Copy Markdown
Collaborator

@sadiqj sadiqj commented Mar 23, 2021

This PR extends #484

  1. We now resize the global minor heap lazily based on the number of running domains participating in a minor collection
  2. Memory is only committed/released when the number of running domains changes, which should be rarely at steady state
  3. Domain local allocation buffers are set to 1/8th of the global minor per domain heap size and are replenished as necessary
  4. It has been rebased on to 4.12

There is currently one failing test locally, timing.ml. I'm still trying to figure out what's going on with that.

@sadiqj sadiqj force-pushed the dlabs_4.12 branch 7 times, most recently from ff04e58 to 3065b2e Compare March 30, 2021 16:44
@sadiqj sadiqj requested a review from abbysmal March 31, 2021 13:25
Copy link
Copy Markdown
Collaborator

@abbysmal abbysmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work!
I think the changes are fine, I found a thing or two that looks weird, but overall the core logic looks sane to me.
The minor heap setting bits are a bit tricky, I will address these as I take over the branch. (and I think the testsuite deserves more coverage for these)

Naming wise, I think things make sense now, there's a few config options we may want to change, I will take care of this later as well.

// Calculate the size of the existing mapping
int global_minor_heap_size = caml_global_minor_heap_limit - caml_global_minor_heap_start;
// Now using the number of participating domains, we calculate the new size
int new_global_minor_heap_size = participating_domains*Bsize_wsize(global_minor_heap_wsz_per_domain) + caml_global_minor_heap_start;
Copy link
Copy Markdown
Collaborator

@abbysmal abbysmal Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding caml_global_minor_heap_start wasn't meant here
It just works because everything is page aligned (and thus, is effectively adding zero to the number).

}
global_minor_heap_wsz_per_domain = wsize;

if (domain_state->young_ptr != domain_state->young_end) caml_minor_collection ();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always trigger a minor collection here, there are conditions where a minor collection won't be hit on next gc poll

runtime/domain.c Outdated
/* setting young_limit and young_ptr to minor_heaps_base
to trigger minor_heaps reallocation on GC poll */
domain_state->young_start = (char*)caml_global_minor_heap_start;
domain_state->young_end = (char*)caml_global_minor_heap_start;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the minor_heaps_base naming in the comment is now out of date

CAMLextern void caml_minor_collection (void);
CAMLextern asize_t global_minor_heap_wsz_per_domain;

#ifdef CAML_INTERNALS
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal of this CAML_INTERNALS guard intentional?
Aren't some of these functions meant to be guarded to not be visible elsewhere.

runtime/domain.c Outdated
goto reallocate_minor_heap_failure;
/* setting young_limit and young_ptr to minor_heaps_base
to trigger minor_heaps reallocation on GC poll */
domain_state->young_start = (char*)caml_global_minor_heap_start;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function create_domain can execute in parallel with STW segments but the domain is not yet part of the STW. It will be brittle to read caml_global_minor_heap_start multiple times should we ever change how that variable is assigned.
Is there any reason to read this variable at all? Can't we just set all the young_* values to a suitable null value and trigger the collection we need when the values are used?

runtime/domain.c Outdated

young_limit = atomic_load_acq((atomic_uintnat*)&domain_state->young_limit);
if( young_limit != INTERRUPT_MAGIC ) {
atomic_compare_exchange_strong((atomic_uintnat*)&domain_state->young_limit, &young_limit, caml_global_minor_heap_start);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this snippet to update the young_limit should use caml_update_young_limit

runtime/domain.c Outdated
// Check if our minor heap is full. If it is then we need to try to grab a
// new one from the global minor heap.
if( domain_minor_heap_full && !need_minor_gc ) {
uintnat global_ptr =
Copy link
Copy Markdown
Collaborator

@ctk21 ctk21 Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we need to do this check with the global_ptr? We have to deal with that case inside caml_replenish_minor_heap already.

I didn't understand why the body of this if block wasn't just:

/* try to get a new minor heap */
if(!caml_replenish_minor_heap()) {
   need_minor_gc = 1;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this case can be simplified indeed.

runtime/domain.c Outdated
static void caml_poll_gc_work()
{
CAMLalloc_point_here;
{ // No GC in this block
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this comment, as there is a call to caml_empty_minor_heaps_once inside the block!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure as well, I think it is only meant for the first part of this function? (and a Gc should only be triggered after this preambule, through the need_minor_gc flag?) @sadiqj

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read that comment a few times and I don't understand what it means. Suggest we get rid of it.

uintnat cached_global_minor_heap_ptr;
uintnat new_alloc_ptr;

uintnat minor_buffer_wsize = global_minor_heap_wsz_per_domain >> 3;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a concern here that a user might somehow set the minor heap size small and then we would have a minor heap that couldn't hold even the smallest allocation. Do we need a guard here or in the initialization of global_minor_heap_wsz_per_domain?

caml_update_young_limit(cached_global_minor_heap_ptr);

domain_state->young_start = (char*)cached_global_minor_heap_ptr;
domain_state->young_end = (char*)(cached_global_minor_heap_ptr + minor_buffer_bsize);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this minor_buffer_bsize really correct?
Aren't there code paths here where the requested_buffer_size is not equivalent to minor_buffer_bsize?

domain->state->id,
100.0 * (double)st.live_bytes / (double)minor_allocated_bytes,
(unsigned)(minor_allocated_bytes + 512)/1024, rewrite_successes, rewrite_failures);
100.0 * (double)st.live_bytes / (double)(domain_state->stat_minor_words),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change from minor_allocated_bytes to stat_minor_words.
The change makes the fraction in units bytes/words, rather than dimensionless; i.e. wouldn't we need to alter st.live_bytes to be live words.
The second change probably needs us to alter KB live to K words live.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to take into account that now, we may have used more minor heap space than just one segment (the one currently set to the domain.)
If the domain cycled through many minor heap segment with DLABs, we keep track of these in the replenish cycle: every time we are done doing replenish, we set stat_minor_words to += the previous minor heap segment.
This effectively allow us to keep track of the previous minor heap segments used by this domain.
However it means we do need to use stat_minor_words instead of the more local minor_allocated_bytes (that only knows of the current minor heap segment.).
I think a simpler fix would be to change (domain_state->stat_minor_words) to Bsize_wsize(domain_state->stat_minor_words) ?

…t to force gc poll on next allocation path if replenish failed
abbysmal added 17 commits May 4, 2021 11:52
…l take care of knowing if there's enough space or not.
…d be checked against the current caml_global_minor_heap_limit instead.
@ctk21
Copy link
Copy Markdown
Collaborator

ctk21 commented May 10, 2021

I have run this PR (on 20210504 with sandmark) using a large two socket Zen2 machine and some larger than normal benchmark workloads. These are the results I got:
20210504_big_dlabs_time_sec
20210504_big_dlabs_speedup
20210504_big_dlabs_minor_collections

There are some odd things here. We seem to cut the number of minor GCs quite a bit with some of the DLABs run, however in terms of execution time at higher core counts the PR does not seem to improve our runtime. In some instances it is running much slower (e.g. lu_decomposition, minilight, spectralnorm2).

@abbysmal
Copy link
Copy Markdown
Collaborator

Here's an excerpt of the bench results (ran by @shubhamkumar13 on godal (IIT machine)).
Provided results are for 256k sized minor heaps (speedup, time_sec and minor_collections)
speedup
time_secs
minor_collections

@ctk21
Copy link
Copy Markdown
Collaborator

ctk21 commented May 11, 2021

Some experiments were conducted to see if the way we were handling the memory allocations of the minor heaps were responsible for the slowdowns on the big instances. The tags in the results below are:

  • 4.12.0+domains+effects baseline
  • 4.12.0+domains+effects_dlabs this PR
  • 4.12.0+domains+effects_dlabs_malloc this PR but where we make use of glibc malloc to allocate all the minor heap and TLS areas. See this branch, results for this tree 427b02d.
  • 4.12.0+domains+effects_dlabs_commit_decommit this PR but where we commit each minor heap segment as we get it and decommit on minor GC (this is not necessarily sensible, but we wanted to know how bad that is). See this branch, results for this tree 7e1ca31.

20210511_commit_decommit_time
20210511_commit_decommit_speedup

The one idea not tried, but might be is to have a 'minor heap stealing' strategy that would be more like HPC 'workstealing':

  • each domain has its own minor heap as in the existing baseline code
  • when a domain runs out of minor heap, it tries to steal from another domains minor heap
  • when all the minor heap space (or a significant fraction) is used, do the minor collection

It does beg the question as to if using non-local minor heap space is ever better on larger machines than just having an early collection.

@ctk21
Copy link
Copy Markdown
Collaborator

ctk21 commented May 12, 2021

I have run the benchmarks with a prefetching experiment:

  • 4.12+domains+effects baseline (b23a41).
  • 4.12+domains+effects+dlabs this PR (099a65).
  • 4.12+domains+effects+dlabs_prefetch this branch with this tree (f25633)

Here are results with 'big' instances on a detuned Zen2:
20210512_dlabs_prefetch_time
20210512_dlabs_prefetch_speedup
20210512_dlabs_prefetch_mgc

The prefetching is having a beneficial effect. We even see a datapoint with binarytrees where dlabs has out performed at high core counts, offset against it underperforming in the same benchmark at lower core counts without prefetching. Worth noting that the baseline does not perform prefetching on its minor heap, although unclear if that is important.

(NB: sandmark is slightly updated in this run vs the previous for matrix_multiplication and floyd_warshall; apologies that it is not fixed code there)

@ctk21
Copy link
Copy Markdown
Collaborator

ctk21 commented May 21, 2021

We are closing this PR. However we have summarised where we got to here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants