Skip to content

best-fit allocator#8809

Merged
damiendoligez merged 22 commits intoocaml:trunkfrom
damiendoligez:best-fit
Oct 15, 2019
Merged

best-fit allocator#8809
damiendoligez merged 22 commits intoocaml:trunkfrom
damiendoligez:best-fit

Conversation

@damiendoligez
Copy link
Copy Markdown
Member

Add a best-fit allocator, which helps a lot with fragmentation, according to some testing done at Jane Street on memory-hungry programs.

Notes for the reviewer:

The main change is freelist.c. It is now has three main sections:

  • next-fit allocator
  • first-fit allocator
  • best-fit allocator

The first two are the old allocator, split into two separate allocators. Reading these sections should be done in parallel with two copies of the old code (git only does the diff with the first, of course). The third is entirely new code.

I had to change the API of caml_fl_merge_block to avoid the cost of repeated coalescing in the best-fit case.

@alainfrisch
Copy link
Copy Markdown
Contributor

That sounds great! Do you have examples of small programs that exhibit significant gains, or otherwise some intuition about allocation schemes likely to benefit from the new policy? Does the reduced fragmentation comes with some higher GC cost? Do you expect that the new policy could become the default at some point, or is the overhead possibly too large in some common cases?

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Jul 17, 2019

Pretty sure we're trading performance for space -- the GC cost must be higher. It'd be nice to see a comparison.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Jul 17, 2019

I'll give some more practical numbers in a bit, but since you asked for small pathological examples, here's one:

let r = ref []
let r2 = ref [||]

let f () =
  (* Allocate 1 gb of mem, allocate some more and compact to make
     sure we have 1gb of mem + 80% gc overhead allocated right now. *)
  let len = 1024 * 1024 * 1024 / (Sys.word_size / 8) in
  let gb_of_mem = Array.init len (fun i -> Some i) in
  ignore (Array.copy gb_of_mem : _ array);
  Gc.compact ();
  let count = ref 0 in
  Printf.printf "initialized\n%!";
  while true; do
    (* And now: allocate continuously. Every 250 words of transiently allocation on
       average, keep one word alive in [gb_of_mem] to create holes in any gap.  We do not
       increase live memory though since we replace some other allocation from [r]. *)
    let conses = Random.int 500 / 3 in
    r2 := [||];
    for _ = 0 to conses - 1; do r := 1 :: !r; done;
    gb_of_mem.(Random.int len) <- Some !count;
    (Sys.opaque_identity r) := [];
    if !count mod 1500000 = 0 then
      let live_words = len * 3 (* 1 for the array + 2 for the Some *) in
      let stat = Gc.stat () in
      Printf.printf "chunks=%d, largest free: %.2fMW, heap_words=%.2fGW, live_fraction=%f%%\n%!"
        stat.heap_chunks
        (Float.of_int stat.largest_free *. 1e-6)
        (Float.of_int stat.heap_words *. 1e-9)
        (100. *. Float.of_int live_words /. Float.of_int stat.heap_words)
    else if !count mod 10000 = 5000 then
      (* from times to times, the program needs to do a big allocation, and because
         of all the fragmentation, the gc grows the heap *)
      r2 := Array.make 50_000 0;
    incr count;
  done;
;;

let () = f ()

I think this should grow its heap steadily until it eventually compacts with next fit, even though the number of live words is constant. With best fit it will stop growing the heap once the overhead matches the given space_overhead parameter.

I'm not sure that these small examples are the best way to evaluate these things, but you did ask. Judging from my results with larger programs, there are other pathological cases fixed by best-fit which don't look like this, but I don't have small examples of those to hand.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Jul 17, 2019

(The above example is due to @sliquister)

@alainfrisch
Copy link
Copy Markdown
Contributor

Thanks @lpw25 . I think this is useful to get an intuition on when this new policy can help. For such a program, I understand that fragmentation is better, and the heap will stop growing earlier than with other strategies. Do you have some idea of the runtime overhead due the new policy (as long as the heap remains reasonnable with former policies)?

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Jul 17, 2019

Mostly I haven't seen an observable overhead.

A couple of programs had some small overhead (1-2%) which seemed to just be a product of having a smaller heap. Such overheads were easily compensated for by just relaxing space_overhead a bit -- which you can do since you are using less heap now.

I think another way to look at the above is that the current GC heuristics for ensuring that space_overhead is correct assume that there is no fragmentation. Since there is fragmentation we are actually getting more overhead than space_overhead usually -- so a setting of 80% is secretly more like a setting of 100%. When best-fit reduces fragmentation it makes the space_overhead heuristics more accurate -- so to keep equivalent behaviour you should increase the value of the space_overhead parameter.

In some cases there have been notable improvements in execution times. One real program had a more than 50% reduction in execution time with best-fit because it was hitting a pathological case in the next-fit allocator that lead to it spending 65% of its time traversing the free list.

@stedolan
Copy link
Copy Markdown
Contributor

I don't have time this evening, but I'll read this properly next week. It looks great - the splay tree of lists gives a much more efficient search than the current allocator. I haven't read the code in detail yet, but I'm curious about the design and allocation policy for small allocations. (For large allocations, the policy seems very simple - precise best-fit - and the implementation is really nice).

The preallocation mechanism, if I understand it correctly, allocates space for 100 small objects when the first allocation is requested so that the next 99 allocations are fast. This happens if there is no currently available small slot of the correct size or greater. So, if I allocate 500 cons cells, I only do 5 accesses to the splay tree.

However, if I allocate a single 16-word object before my 1000 cons cells, then I preallocate 1700 words of space and none of my 500 subsequent cons-cell allocations hit the preallocation fast-path, all hitting the bf_split_small/bf_insert_fragment_small path instead. I'm not sure how much slower this is, but it seems a bit unfortunate.

As an alternative, might it be better to avoid ever splitting small blocks, and only merging them if the result of the merge is bigger than BF_NUM_SMALL? That way, we'd hit the fast preallocated path in this case as well. (I imagine there's some horrible pathological case here. It might be that the current policy is better, but this is the part of the design that I understand least).

Also, have you put any thought into how this design might work with multicore?

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Jul 17, 2019

I guess that makes sense. If you have fragmentation, not only are you traversing the free list for longer, you're also accessing fresh memory more often, which could reduce cache efficiency, which would dominate other performance considerations.

}
Next_small (Val_hp (p)) = Val_NULL;
/* Put the chain in the free list. */
bf_small_fl[wosz].free = Next_small (Val_hp (big));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that bf_small_fl[wosz] is currently empty. However, the call to bf_allocate_from_tree above on line 1533 could have added a node to it. This leaks a node and results in an assertion failure with the debug runtime.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch indeed!

Do you have a simple repro case? I have a fix but I'd like to test it.

Copy link
Copy Markdown
Contributor

@lpw25 lpw25 Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afraid not. I think it is causing a segfault in this benchmark about 1 time in 20. I'm just checking at the moment that this is indeed the cause.

@damiendoligez
Copy link
Copy Markdown
Member Author

A few answers before I go on vacation:

@alainfrisch

some intuition about allocation schemes likely to benefit from the new policy?

Our experience so far seems to show that all programs with large heaps will benefit to some extent. In normal cases, because the program fits in a noticeably smaller heap; in pathological cases because the compaction will trigger much less often.

Does the reduced fragmentation comes with some higher GC cost?

The cost of maintaining and using the best-fit data structures doesn't seem to be higher than the simple next-fit free-list. But the space-time trade-off of the GC is actually controlled by the GC speed feedback loop, and when your heap is smaller (for example because of reduced fragmentation) it will spend more time in the GC. You can then recover your previous speed by increasing the space_overhead parameter of the GC.

Do you expect that the new policy could become the default at some point, or is the overhead possibly too large in some common cases?

Yes, I expect to switch the default to best-fit, as the overhead is (as far as we can tell) negligible.

@stedolan

However, if I allocate a single 16-word object before my 1000 cons cells, then I preallocate 1700 words of space and none of my 500 subsequent cons-cell allocations hit the preallocation fast-path, all hitting the bf_split_small/bf_insert_fragment_small path instead. I'm not sure how much slower this is, but it seems a bit unfortunate.

We could partly alleviate this problem by pre-allocating a fixed number of words (instead of blocks) so the list of 16-word blocks would be much shorter.

As an alternative, might it be better to avoid ever splitting small blocks, and only merging them if the result of the merge is bigger than BF_NUM_SMALL? That way, we'd hit the fast preallocated path in this case as well. (I imagine there's some horrible pathological case here. It might be that the current policy is better, but this is the part of the design that I understand least).

I'm really wary of deviating too much from a strict best-fit policy because it's hard to tell what will happen in real programs. For example Wilson et al[1] observes that many programs have multiple phases during which they allocate some set of sizes, so it's expected that a program will allocate lots of 16-word blocks for some time, then suddenly switch to 2-word blocks. If you never split small blocks, you'll be stuck with a long list of 16-word blocks.

Also, have you put any thought into how this design might work with multicore?

The small lists can be per-domain, and the splay tree will need locking. I'm worried about the contention on that lock. Maybe it can be reduced by splaying less often and using a single-writer/multiple-readers lock. If that doesn't work, we'll need to switch to a different data structure (and maybe a relaxed policy).

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Aug 2, 2019

I'm worried about the contention on that lock. Maybe it can be reduced by splaying less often and using a single-writer/multiple-readers lock.

Note to our future selfs: there are approaches like "Lazy Splaying" which apparently reduce the contention in concurrent cases whilst still preserving good performance on unbalanced workloads, so we might want to look at them if this does turn out to be an issue.

and allocate directly from the smallest block of the tree.

This removes the need for pre-loading the small free lists.
@damiendoligez
Copy link
Copy Markdown
Member Author

@lpw25 @stedolan This is the version with direct allocation in the smallest heap block. Also available for 4.08 at https://github.com/damiendoligez/ocaml/archive/4.08.1+best-fit.tar.gz

@alainfrisch
Copy link
Copy Markdown
Contributor

We have a program where repeatedly executing some action lead to memory growing each time significantly (to more than 1.5Gb) with next-fit but remains reasonnable (at around 500Mb) with first-fit; unfortunately, runtime is multiplied almost by 3 with first-fit. Is that the kind of slowdown expected with first-fit, and is there some hope than the new policy would bring the best of both world? (I suspect that fragmentation is due to the use of demarshaling -- I wonder whether demarshalling into small blocks wouldn't be better than the current strategy performance-wise in some cases; more GC-related overhead during demarshaling but less fragmentation...).

@alainfrisch
Copy link
Copy Markdown
Contributor

Answering to myself, with the help of @nojb who cherry-picked the new policy on our local version: the new policy looks terrific for our use case!

Some results running repeatedly some heavy calculation, monitoring the time taken in each run and the process memory usage after it, with the three policies:

policy 0 policy 1 policy 2
1st run 28s/855Mb 47s/566Mb 27s/545Mb
2nd run 21s/1164Mb 57s/584Mb 22s/563Mb
3rd run 21s/1316Mb 59s/606Mb 23s/572Mb

The new policy is only a tiny bit slower than policy 0 (perhaps in the noise), and even better than policy 1 in terms of memory usage.

With another payload where policy 1 was only slightly slower than policy 0 but clearly better in terms of memory usage, I observe a similar conclusion: policy 2 even better in terms of memory usage, and between 0 and 1 in terms of runtime.

@stedolan
Copy link
Copy Markdown
Contributor

This looks great! I have a couple of notes, but most of them are about comments.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Sep 16, 2019

I've been running the latest version with the debug runtime on all our tests and I'm getting an assertion failure:

file gc_ctrl.c; line 168 ### Assertion failed: prev_hp == NULL || Color_hp (prev_hp) != Caml_blue || cur_hp == (header_t *) caml_gc_sweep_hp

which seems to indicate that there are fragments preceded by blue blocks on the heap, which I did not think was supposed to happen with the latest version.

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 8, 2019

The org-mode slides of Damien's talk are now available, as well as the live notes I took during the talk.

@stedolan
Copy link
Copy Markdown
Contributor

stedolan commented Oct 8, 2019

Alternatively, we can check when coalescing that the block pointed by caml_fl_merge is still blue.

Good idea! Here's an alternative patch along those lines:

diff --git a/runtime/freelist.c b/runtime/freelist.c
index 6ddc44db37..26f4fc8644 100644
--- a/runtime/freelist.c
+++ b/runtime/freelist.c
@@ -1669,9 +1669,9 @@ static header_t *bf_merge_block (value bp, char *limit)
 
   CAMLassert (Color_val (bp) == Caml_white);
   /* Find the starting point of the current run of free blocks. */
-  if (caml_fl_merge != Val_NULL && Next_in_mem (caml_fl_merge) == bp){
+  if (caml_fl_merge != Val_NULL && Next_in_mem (caml_fl_merge) == bp &&
+      Color_val(caml_fl_merge) == Caml_blue){
     start = caml_fl_merge;
-    CAMLassert (Color_val (start) == Caml_blue);
     bf_remove (start);
   }else{
     start = bp;

@damiendoligez
Copy link
Copy Markdown
Member Author

I like this version, it seems more robust and probably less run-time overhead. It relies on the property that bf_merge_block is the only function that can turn a header into a non-header.
(and it will never do that to the header under caml_fl_merge)

@damiendoligez
Copy link
Copy Markdown
Member Author

damiendoligez commented Oct 8, 2019

In order to satisfy the theoretical guarantees provided by splay trees, we have to splay at every single access, including insertions and removals. Basically, each time we follow a left or right edge, we have to consider splaying it.

A few hours discussing this with @jhjourdan yielded a way of doing that without re-splaying, for all the needed tree functions. It will be a rather large change for a better worst-case guarantee but probably no noticeable speed gain. We've decided to postpone it to 4.11.

lpw25 pushed a commit to janestreet/ocaml that referenced this pull request Oct 8, 2019
@damiendoligez
Copy link
Copy Markdown
Member Author

Another remark: some C bindings (e.g., Lablgtk) depend on disabling the compactor because they have trouble registering all their roots. I guess this would mean that they are incompatible with best-fit, because changing the policy requires a compaction.

FTR: you can choose the initial policy with the OCAMLRUNPARAM environment variable.

damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Oct 11, 2019
damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Oct 11, 2019
@damiendoligez damiendoligez merged commit 01bdd5b into ocaml:trunk Oct 15, 2019
@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 15, 2019

Congratulation on everyone involved, this is a big, nice change. We have to make sure in the release process that heavy users know to try this new policy, and that they do come and make reports to us about improvements and, more importantly, any potential regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants