Domain local allocation buffers#508
Domain local allocation buffers#508sadiqj wants to merge 27 commits intoocaml-multicore:4.12+domains+effectsfrom
Conversation
ff04e58 to
3065b2e
Compare
abbysmal
left a comment
There was a problem hiding this comment.
Thanks for this work!
I think the changes are fine, I found a thing or two that looks weird, but overall the core logic looks sane to me.
The minor heap setting bits are a bit tricky, I will address these as I take over the branch. (and I think the testsuite deserves more coverage for these)
Naming wise, I think things make sense now, there's a few config options we may want to change, I will take care of this later as well.
runtime/minor_gc.c
Outdated
| // Calculate the size of the existing mapping | ||
| int global_minor_heap_size = caml_global_minor_heap_limit - caml_global_minor_heap_start; | ||
| // Now using the number of participating domains, we calculate the new size | ||
| int new_global_minor_heap_size = participating_domains*Bsize_wsize(global_minor_heap_wsz_per_domain) + caml_global_minor_heap_start; |
There was a problem hiding this comment.
I think adding caml_global_minor_heap_start wasn't meant here
It just works because everything is page aligned (and thus, is effectively adding zero to the number).
runtime/minor_gc.c
Outdated
| } | ||
| global_minor_heap_wsz_per_domain = wsize; | ||
|
|
||
| if (domain_state->young_ptr != domain_state->young_end) caml_minor_collection (); |
There was a problem hiding this comment.
I think we should always trigger a minor collection here, there are conditions where a minor collection won't be hit on next gc poll
runtime/domain.c
Outdated
| /* setting young_limit and young_ptr to minor_heaps_base | ||
| to trigger minor_heaps reallocation on GC poll */ | ||
| domain_state->young_start = (char*)caml_global_minor_heap_start; | ||
| domain_state->young_end = (char*)caml_global_minor_heap_start; |
There was a problem hiding this comment.
the minor_heaps_base naming in the comment is now out of date
| CAMLextern void caml_minor_collection (void); | ||
| CAMLextern asize_t global_minor_heap_wsz_per_domain; | ||
|
|
||
| #ifdef CAML_INTERNALS |
There was a problem hiding this comment.
Is the removal of this CAML_INTERNALS guard intentional?
Aren't some of these functions meant to be guarded to not be visible elsewhere.
runtime/domain.c
Outdated
| goto reallocate_minor_heap_failure; | ||
| /* setting young_limit and young_ptr to minor_heaps_base | ||
| to trigger minor_heaps reallocation on GC poll */ | ||
| domain_state->young_start = (char*)caml_global_minor_heap_start; |
There was a problem hiding this comment.
This function create_domain can execute in parallel with STW segments but the domain is not yet part of the STW. It will be brittle to read caml_global_minor_heap_start multiple times should we ever change how that variable is assigned.
Is there any reason to read this variable at all? Can't we just set all the young_* values to a suitable null value and trigger the collection we need when the values are used?
runtime/domain.c
Outdated
|
|
||
| young_limit = atomic_load_acq((atomic_uintnat*)&domain_state->young_limit); | ||
| if( young_limit != INTERRUPT_MAGIC ) { | ||
| atomic_compare_exchange_strong((atomic_uintnat*)&domain_state->young_limit, &young_limit, caml_global_minor_heap_start); |
There was a problem hiding this comment.
I think this snippet to update the young_limit should use caml_update_young_limit
runtime/domain.c
Outdated
| // Check if our minor heap is full. If it is then we need to try to grab a | ||
| // new one from the global minor heap. | ||
| if( domain_minor_heap_full && !need_minor_gc ) { | ||
| uintnat global_ptr = |
There was a problem hiding this comment.
Is there a reason why we need to do this check with the global_ptr? We have to deal with that case inside caml_replenish_minor_heap already.
I didn't understand why the body of this if block wasn't just:
/* try to get a new minor heap */
if(!caml_replenish_minor_heap()) {
need_minor_gc = 1;
}There was a problem hiding this comment.
Good catch, this case can be simplified indeed.
runtime/domain.c
Outdated
| static void caml_poll_gc_work() | ||
| { | ||
| CAMLalloc_point_here; | ||
| { // No GC in this block |
There was a problem hiding this comment.
I didn't understand this comment, as there is a call to caml_empty_minor_heaps_once inside the block!
There was a problem hiding this comment.
I'm unsure as well, I think it is only meant for the first part of this function? (and a Gc should only be triggered after this preambule, through the need_minor_gc flag?) @sadiqj
There was a problem hiding this comment.
I've read that comment a few times and I don't understand what it means. Suggest we get rid of it.
runtime/minor_gc.c
Outdated
| uintnat cached_global_minor_heap_ptr; | ||
| uintnat new_alloc_ptr; | ||
|
|
||
| uintnat minor_buffer_wsize = global_minor_heap_wsz_per_domain >> 3; |
There was a problem hiding this comment.
I had a concern here that a user might somehow set the minor heap size small and then we would have a minor heap that couldn't hold even the smallest allocation. Do we need a guard here or in the initialization of global_minor_heap_wsz_per_domain?
| caml_update_young_limit(cached_global_minor_heap_ptr); | ||
|
|
||
| domain_state->young_start = (char*)cached_global_minor_heap_ptr; | ||
| domain_state->young_end = (char*)(cached_global_minor_heap_ptr + minor_buffer_bsize); |
There was a problem hiding this comment.
Is this minor_buffer_bsize really correct?
Aren't there code paths here where the requested_buffer_size is not equivalent to minor_buffer_bsize?
runtime/minor_gc.c
Outdated
| domain->state->id, | ||
| 100.0 * (double)st.live_bytes / (double)minor_allocated_bytes, | ||
| (unsigned)(minor_allocated_bytes + 512)/1024, rewrite_successes, rewrite_failures); | ||
| 100.0 * (double)st.live_bytes / (double)(domain_state->stat_minor_words), |
There was a problem hiding this comment.
I don't understand this change from minor_allocated_bytes to stat_minor_words.
The change makes the fraction in units bytes/words, rather than dimensionless; i.e. wouldn't we need to alter st.live_bytes to be live words.
The second change probably needs us to alter KB live to K words live.
There was a problem hiding this comment.
This change is to take into account that now, we may have used more minor heap space than just one segment (the one currently set to the domain.)
If the domain cycled through many minor heap segment with DLABs, we keep track of these in the replenish cycle: every time we are done doing replenish, we set stat_minor_words to += the previous minor heap segment.
This effectively allow us to keep track of the previous minor heap segments used by this domain.
However it means we do need to use stat_minor_words instead of the more local minor_allocated_bytes (that only knows of the current minor heap segment.).
I think a simpler fix would be to change (domain_state->stat_minor_words) to Bsize_wsize(domain_state->stat_minor_words) ?
Co-authored-by: Engil <decorne.en@gmail.com>
…ot + minor_buffer_bsize
…gment at least, fail otherwise
…t to force gc poll on next allocation path if replenish failed
… ending a minor cycle and at domain startup
…l take care of knowing if there's enough space or not.
…ion in gc_set, add gc_set test back
…ats as bytes, not words
…d be checked against the current caml_global_minor_heap_limit instead.
…ze per domain should be < Minox_heap_max)
…e performaned on uintnat, not in
…> Minor_heap_max * Num_domains
|
Here's an excerpt of the bench results (ran by @shubhamkumar13 on godal (IIT machine)). |
|
Some experiments were conducted to see if the way we were handling the memory allocations of the minor heaps were responsible for the slowdowns on the big instances. The tags in the results below are:
The one idea not tried, but might be is to have a 'minor heap stealing' strategy that would be more like HPC 'workstealing':
It does beg the question as to if using non-local minor heap space is ever better on larger machines than just having an early collection. |
|
I have run the benchmarks with a prefetching experiment:
Here are results with 'big' instances on a detuned Zen2: The prefetching is having a beneficial effect. We even see a datapoint with binarytrees where dlabs has out performed at high core counts, offset against it underperforming in the same benchmark at lower core counts without prefetching. Worth noting that the baseline does not perform prefetching on its minor heap, although unclear if that is important. (NB: sandmark is slightly updated in this run vs the previous for matrix_multiplication and floyd_warshall; apologies that it is not fixed code there) |
|
We are closing this PR. However we have summarised where we got to here. |











This PR extends #484
There is currently one failing test locally,
timing.ml. I'm still trying to figure out what's going on with that.