best-fit allocator by damiendoligez · Pull Request #8809 · ocaml/ocaml

damiendoligez · 2019-07-17T12:55:03Z

Add a best-fit allocator, which helps a lot with fragmentation, according to some testing done at Jane Street on memory-hungry programs.

Notes for the reviewer:

The main change is freelist.c. It is now has three main sections:

next-fit allocator
first-fit allocator
best-fit allocator

The first two are the old allocator, split into two separate allocators. Reading these sections should be done in parallel with two copies of the old code (git only does the diff with the first, of course). The third is entirely new code.

I had to change the API of caml_fl_merge_block to avoid the cost of repeated coalescing in the best-fit case.

alainfrisch · 2019-07-17T14:10:31Z

That sounds great! Do you have examples of small programs that exhibit significant gains, or otherwise some intuition about allocation schemes likely to benefit from the new policy? Does the reduced fragmentation comes with some higher GC cost? Do you expect that the new policy could become the default at some point, or is the overhead possibly too large in some common cases?

bluddy · 2019-07-17T14:35:14Z

Pretty sure we're trading performance for space -- the GC cost must be higher. It'd be nice to see a comparison.

lpw25 · 2019-07-17T14:48:48Z

I'll give some more practical numbers in a bit, but since you asked for small pathological examples, here's one:

let r = ref []
let r2 = ref [||]

let f () =
  (* Allocate 1 gb of mem, allocate some more and compact to make
     sure we have 1gb of mem + 80% gc overhead allocated right now. *)
  let len = 1024 * 1024 * 1024 / (Sys.word_size / 8) in
  let gb_of_mem = Array.init len (fun i -> Some i) in
  ignore (Array.copy gb_of_mem : _ array);
  Gc.compact ();
  let count = ref 0 in
  Printf.printf "initialized\n%!";
  while true; do
    (* And now: allocate continuously. Every 250 words of transiently allocation on
       average, keep one word alive in [gb_of_mem] to create holes in any gap.  We do not
       increase live memory though since we replace some other allocation from [r]. *)
    let conses = Random.int 500 / 3 in
    r2 := [||];
    for _ = 0 to conses - 1; do r := 1 :: !r; done;
    gb_of_mem.(Random.int len) <- Some !count;
    (Sys.opaque_identity r) := [];
    if !count mod 1500000 = 0 then
      let live_words = len * 3 (* 1 for the array + 2 for the Some *) in
      let stat = Gc.stat () in
      Printf.printf "chunks=%d, largest free: %.2fMW, heap_words=%.2fGW, live_fraction=%f%%\n%!"
        stat.heap_chunks
        (Float.of_int stat.largest_free *. 1e-6)
        (Float.of_int stat.heap_words *. 1e-9)
        (100. *. Float.of_int live_words /. Float.of_int stat.heap_words)
    else if !count mod 10000 = 5000 then
      (* from times to times, the program needs to do a big allocation, and because
         of all the fragmentation, the gc grows the heap *)
      r2 := Array.make 50_000 0;
    incr count;
  done;
;;

let () = f ()

I think this should grow its heap steadily until it eventually compacts with next fit, even though the number of live words is constant. With best fit it will stop growing the heap once the overhead matches the given space_overhead parameter.

I'm not sure that these small examples are the best way to evaluate these things, but you did ask. Judging from my results with larger programs, there are other pathological cases fixed by best-fit which don't look like this, but I don't have small examples of those to hand.

lpw25 · 2019-07-17T14:49:37Z

(The above example is due to @sliquister)

alainfrisch · 2019-07-17T15:16:22Z

Thanks @lpw25 . I think this is useful to get an intuition on when this new policy can help. For such a program, I understand that fragmentation is better, and the heap will stop growing earlier than with other strategies. Do you have some idea of the runtime overhead due the new policy (as long as the heap remains reasonnable with former policies)?

lpw25 · 2019-07-17T15:30:44Z

Mostly I haven't seen an observable overhead.

A couple of programs had some small overhead (1-2%) which seemed to just be a product of having a smaller heap. Such overheads were easily compensated for by just relaxing space_overhead a bit -- which you can do since you are using less heap now.

I think another way to look at the above is that the current GC heuristics for ensuring that space_overhead is correct assume that there is no fragmentation. Since there is fragmentation we are actually getting more overhead than space_overhead usually -- so a setting of 80% is secretly more like a setting of 100%. When best-fit reduces fragmentation it makes the space_overhead heuristics more accurate -- so to keep equivalent behaviour you should increase the value of the space_overhead parameter.

In some cases there have been notable improvements in execution times. One real program had a more than 50% reduction in execution time with best-fit because it was hitting a pathological case in the next-fit allocator that lead to it spending 65% of its time traversing the free list.

stedolan · 2019-07-17T15:51:49Z

I don't have time this evening, but I'll read this properly next week. It looks great - the splay tree of lists gives a much more efficient search than the current allocator. I haven't read the code in detail yet, but I'm curious about the design and allocation policy for small allocations. (For large allocations, the policy seems very simple - precise best-fit - and the implementation is really nice).

The preallocation mechanism, if I understand it correctly, allocates space for 100 small objects when the first allocation is requested so that the next 99 allocations are fast. This happens if there is no currently available small slot of the correct size or greater. So, if I allocate 500 cons cells, I only do 5 accesses to the splay tree.

However, if I allocate a single 16-word object before my 1000 cons cells, then I preallocate 1700 words of space and none of my 500 subsequent cons-cell allocations hit the preallocation fast-path, all hitting the bf_split_small/bf_insert_fragment_small path instead. I'm not sure how much slower this is, but it seems a bit unfortunate.

As an alternative, might it be better to avoid ever splitting small blocks, and only merging them if the result of the merge is bigger than BF_NUM_SMALL? That way, we'd hit the fast preallocated path in this case as well. (I imagine there's some horrible pathological case here. It might be that the current policy is better, but this is the part of the design that I understand least).

Also, have you put any thought into how this design might work with multicore?

bluddy · 2019-07-17T15:52:05Z

I guess that makes sense. If you have fragmentation, not only are you traversing the free list for longer, you're also accessing fresh memory more often, which could reduce cache efficiency, which would dominate other performance considerations.

lpw25 · 2019-08-02T10:32:12Z

runtime/freelist.c

+        }
+        Next_small (Val_hp (p)) = Val_NULL;
+        /* Put the chain in the free list. */
+        bf_small_fl[wosz].free = Next_small (Val_hp (big));


This assumes that bf_small_fl[wosz] is currently empty. However, the call to bf_allocate_from_tree above on line 1533 could have added a node to it. This leaks a node and results in an assertion failure with the debug runtime.

Good catch indeed!

Do you have a simple repro case? I have a fix but I'd like to test it.

Afraid not. I think it is causing a segfault in this benchmark about 1 time in 20. I'm just checking at the moment that this is indeed the cause.

damiendoligez · 2019-08-02T14:03:24Z

A few answers before I go on vacation:

@alainfrisch

some intuition about allocation schemes likely to benefit from the new policy?

Our experience so far seems to show that all programs with large heaps will benefit to some extent. In normal cases, because the program fits in a noticeably smaller heap; in pathological cases because the compaction will trigger much less often.

Does the reduced fragmentation comes with some higher GC cost?

The cost of maintaining and using the best-fit data structures doesn't seem to be higher than the simple next-fit free-list. But the space-time trade-off of the GC is actually controlled by the GC speed feedback loop, and when your heap is smaller (for example because of reduced fragmentation) it will spend more time in the GC. You can then recover your previous speed by increasing the space_overhead parameter of the GC.

Do you expect that the new policy could become the default at some point, or is the overhead possibly too large in some common cases?

Yes, I expect to switch the default to best-fit, as the overhead is (as far as we can tell) negligible.

@stedolan

However, if I allocate a single 16-word object before my 1000 cons cells, then I preallocate 1700 words of space and none of my 500 subsequent cons-cell allocations hit the preallocation fast-path, all hitting the bf_split_small/bf_insert_fragment_small path instead. I'm not sure how much slower this is, but it seems a bit unfortunate.

We could partly alleviate this problem by pre-allocating a fixed number of words (instead of blocks) so the list of 16-word blocks would be much shorter.

As an alternative, might it be better to avoid ever splitting small blocks, and only merging them if the result of the merge is bigger than BF_NUM_SMALL? That way, we'd hit the fast preallocated path in this case as well. (I imagine there's some horrible pathological case here. It might be that the current policy is better, but this is the part of the design that I understand least).

I'm really wary of deviating too much from a strict best-fit policy because it's hard to tell what will happen in real programs. For example Wilson et al[1] observes that many programs have multiple phases during which they allocate some set of sizes, so it's expected that a program will allocate lots of 16-word blocks for some time, then suddenly switch to 2-word blocks. If you never split small blocks, you'll be stuck with a long list of 16-word blocks.

Also, have you put any thought into how this design might work with multicore?

The small lists can be per-domain, and the splay tree will need locking. I'm worried about the contention on that lock. Maybe it can be reduced by splaying less often and using a single-writer/multiple-readers lock. If that doesn't work, we'll need to switch to a different data structure (and maybe a relaxed policy).

lpw25 · 2019-08-02T14:19:54Z

I'm worried about the contention on that lock. Maybe it can be reduced by splaying less often and using a single-writer/multiple-readers lock.

Note to our future selfs: there are approaches like "Lazy Splaying" which apparently reduce the contention in concurrent cases whilst still preserving good performance on unbalanced workloads, so we might want to look at them if this does turn out to be an issue.

and allocate directly from the smallest block of the tree. This removes the need for pre-loading the small free lists.

damiendoligez · 2019-09-05T15:36:33Z

@lpw25 @stedolan This is the version with direct allocation in the smallest heap block. Also available for 4.08 at https://github.com/damiendoligez/ocaml/archive/4.08.1+best-fit.tar.gz

alainfrisch · 2019-09-12T20:21:13Z

We have a program where repeatedly executing some action lead to memory growing each time significantly (to more than 1.5Gb) with next-fit but remains reasonnable (at around 500Mb) with first-fit; unfortunately, runtime is multiplied almost by 3 with first-fit. Is that the kind of slowdown expected with first-fit, and is there some hope than the new policy would bring the best of both world? (I suspect that fragmentation is due to the use of demarshaling -- I wonder whether demarshalling into small blocks wouldn't be better than the current strategy performance-wise in some cases; more GC-related overhead during demarshaling but less fragmentation...).

alainfrisch · 2019-09-16T13:45:31Z

Answering to myself, with the help of @nojb who cherry-picked the new policy on our local version: the new policy looks terrific for our use case!

Some results running repeatedly some heavy calculation, monitoring the time taken in each run and the process memory usage after it, with the three policies:

	policy 0	policy 1	policy 2
1st run	28s/855Mb	47s/566Mb	27s/545Mb
2nd run	21s/1164Mb	57s/584Mb	22s/563Mb
3rd run	21s/1316Mb	59s/606Mb	23s/572Mb

The new policy is only a tiny bit slower than policy 0 (perhaps in the noise), and even better than policy 1 in terms of memory usage.

With another payload where policy 1 was only slightly slower than policy 0 but clearly better in terms of memory usage, I observe a similar conclusion: policy 2 even better in terms of memory usage, and between 0 and 1 in terms of runtime.

runtime/freelist.c

stedolan · 2019-09-16T15:28:58Z

This looks great! I have a couple of notes, but most of them are about comments.

lpw25 · 2019-09-16T16:48:35Z

I've been running the latest version with the debug runtime on all our tests and I'm getting an assertion failure:

file gc_ctrl.c; line 168 ### Assertion failed: prev_hp == NULL || Color_hp (prev_hp) != Caml_blue || cur_hp == (header_t *) caml_gc_sweep_hp

which seems to indicate that there are fragments preceded by blue blocks on the heap, which I did not think was supposed to happen with the latest version.

runtime/freelist.c

Co-Authored-By: Stephen Dolan <mu@netsoc.tcd.ie>

gasche · 2019-10-08T13:48:17Z

The org-mode slides of Damien's talk are now available, as well as the live notes I took during the talk.

stedolan · 2019-10-08T14:25:40Z

Alternatively, we can check when coalescing that the block pointed by caml_fl_merge is still blue.

Good idea! Here's an alternative patch along those lines:

diff --git a/runtime/freelist.c b/runtime/freelist.c
index 6ddc44db37..26f4fc8644 100644
--- a/runtime/freelist.c
+++ b/runtime/freelist.c
@@ -1669,9 +1669,9 @@ static header_t *bf_merge_block (value bp, char *limit)
 
   CAMLassert (Color_val (bp) == Caml_white);
   /* Find the starting point of the current run of free blocks. */
-  if (caml_fl_merge != Val_NULL && Next_in_mem (caml_fl_merge) == bp){
+  if (caml_fl_merge != Val_NULL && Next_in_mem (caml_fl_merge) == bp &&
+      Color_val(caml_fl_merge) == Caml_blue){
     start = caml_fl_merge;
-    CAMLassert (Color_val (start) == Caml_blue);
     bf_remove (start);
   }else{
     start = bp;

damiendoligez · 2019-10-08T14:34:19Z

I like this version, it seems more robust and probably less run-time overhead. It relies on the property that bf_merge_block is the only function that can turn a header into a non-header.
(and it will never do that to the header under caml_fl_merge)

damiendoligez · 2019-10-08T14:37:55Z

In order to satisfy the theoretical guarantees provided by splay trees, we have to splay at every single access, including insertions and removals. Basically, each time we follow a left or right edge, we have to consider splaying it.

A few hours discussing this with @jhjourdan yielded a way of doing that without re-splaying, for all the needed tree functions. It will be a rather large change for a better worst-case guarantee but probably no noticeable speed gain. We've decided to postpone it to 4.11.

runtime/freelist.c

damiendoligez · 2019-10-10T12:18:50Z

Another remark: some C bindings (e.g., Lablgtk) depend on disabling the compactor because they have trouble registering all their roots. I guess this would mean that they are incompatible with best-fit, because changing the policy requires a compaction.

FTR: you can choose the initial policy with the OCAMLRUNPARAM environment variable.

gasche · 2019-10-15T14:10:30Z

Congratulation on everyone involved, this is a big, nice change. We have to make sure in the release process that heavy users know to try this new policy, and that they do come and make reports to us about improvements and, more importantly, any potential regression.

…8809 (comment))

damiendoligez force-pushed the best-fit branch from 03ffcf5 to 666a143 Compare July 17, 2019 13:00

stedolan self-assigned this Jul 17, 2019

damiendoligez mentioned this pull request Jul 25, 2019

Move C global variables to a dedicated structure #8713

Merged

17 tasks

lpw25 reviewed Aug 2, 2019

View reviewed changes

damiendoligez added 3 commits September 5, 2019 16:38

best-fit allocator

daf6018

add Changes entry

d95f80b

fix a memory lead in bf_allocate

0ccfd4f

damiendoligez force-pushed the best-fit branch from eec86b9 to 01a4312 Compare September 5, 2019 14:39

damiendoligez added 2 commits September 5, 2019 17:09

When the small free lists are empty, skip the tree search (if possible)

7e193fb

and allocate directly from the smallest block of the tree. This removes the need for pre-loading the small free lists.

add a debug macro for readability

998cc89

damiendoligez force-pushed the best-fit branch from 01a4312 to 998cc89 Compare September 5, 2019 15:10

stedolan requested changes Sep 16, 2019

View reviewed changes

stedolan reviewed Sep 18, 2019

View reviewed changes

runtime/freelist.c Outdated Show resolved Hide resolved

damiendoligez and others added 2 commits September 18, 2019 15:56

runtime/freelist.c: make bf_small_map a static variable

05a7736

Co-Authored-By: Stephen Dolan <mu@netsoc.tcd.ie>

simplify bf_merge_block

5adc0d1

fix bug reported by @lpw25 in ocaml#8809 (review)

7887f7d

@stedolan's patch from ocaml#8809 (comment)

2226818

xavierleroy reviewed Oct 8, 2019

View reviewed changes

runtime/freelist.c Show resolved Hide resolved

lpw25 pushed a commit to janestreet/ocaml that referenced this pull request Oct 8, 2019

fix bug reported by @lpw25 in ocaml#8809 (review)

c735545

damiendoligez added 3 commits October 9, 2019 12:27

fix bug reported by @jhjourdan

f956f1a

avoid calling caml_final_do_calls directly in caml_gc_set

ff0b947

remove unused variables, parameter, and comment

2b2bc17

damiendoligez mentioned this pull request Oct 9, 2019

In the best-fit allocator, we should splay on every access. #9029

Closed

damiendoligez added 5 commits October 9, 2019 16:38

remove unused typedef and enum name

b8d42f0

inline the convenience functions from freelist.c

c298a0d

add/fix comments

dcf4a2b

change wrong call to caml_fl_init_merge in ff_reset

2cf77e8

Small changes suggested by @jhjourdan

0e935c5

add some invariants to bf_check

2d8be44

damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Oct 11, 2019

fix bug reported by @lpw25 in ocaml#8809 (review)

78b17a6

damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Oct 11, 2019

@stedolan's patch from ocaml#8809 (comment)

c559357

update Changes

1882912

damiendoligez merged commit 01bdd5b into ocaml:trunk Oct 15, 2019

gadmm mentioned this pull request Oct 15, 2019

Resource-safe C interface, part 0 #9037

Merged

damiendoligez deleted the best-fit branch November 5, 2019 10:39

gretay-js pushed a commit to janestreet/ocaml that referenced this pull request Nov 19, 2019

fix for the problem found by @lpw25 and debugged by @stedolan (ocaml#…

9e1b05d

…8809 (comment))

gretay-js pushed a commit to janestreet/ocaml that referenced this pull request Nov 19, 2019

fix bug reported by @lpw25 in ocaml#8809 (review)

a71534d

gasche mentioned this pull request Dec 7, 2019

better documentation for the best-fit allocation policy #9169

Merged

ColourGrey mentioned this pull request Aug 13, 2021

Outdated information in the "Future of OCaml" page OCamlverse/ocamlverse.github.io#121

Closed

Conversation

damiendoligez commented Jul 17, 2019

Uh oh!

alainfrisch commented Jul 17, 2019

Uh oh!

bluddy commented Jul 17, 2019

Uh oh!

lpw25 commented Jul 17, 2019

Uh oh!

lpw25 commented Jul 17, 2019

Uh oh!

alainfrisch commented Jul 17, 2019

Uh oh!

lpw25 commented Jul 17, 2019

Uh oh!

stedolan commented Jul 17, 2019

Uh oh!

bluddy commented Jul 17, 2019

Uh oh!

lpw25 Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

damiendoligez Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

lpw25 Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damiendoligez commented Aug 2, 2019

Uh oh!

lpw25 commented Aug 2, 2019

Uh oh!

damiendoligez commented Sep 5, 2019

Uh oh!

alainfrisch commented Sep 12, 2019

Uh oh!

alainfrisch commented Sep 16, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stedolan commented Sep 16, 2019

Uh oh!

lpw25 commented Sep 16, 2019

Uh oh!

Uh oh!

gasche commented Oct 8, 2019

Uh oh!

stedolan commented Oct 8, 2019

Uh oh!

damiendoligez commented Oct 8, 2019

Uh oh!

damiendoligez commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

damiendoligez commented Oct 10, 2019

Uh oh!

gasche commented Oct 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

lpw25 Aug 2, 2019 •

edited

Loading

damiendoligez commented Oct 8, 2019 •

edited

Loading