Conversation
|
That sounds great! Do you have examples of small programs that exhibit significant gains, or otherwise some intuition about allocation schemes likely to benefit from the new policy? Does the reduced fragmentation comes with some higher GC cost? Do you expect that the new policy could become the default at some point, or is the overhead possibly too large in some common cases? |
|
Pretty sure we're trading performance for space -- the GC cost must be higher. It'd be nice to see a comparison. |
|
I'll give some more practical numbers in a bit, but since you asked for small pathological examples, here's one: let r = ref []
let r2 = ref [||]
let f () =
(* Allocate 1 gb of mem, allocate some more and compact to make
sure we have 1gb of mem + 80% gc overhead allocated right now. *)
let len = 1024 * 1024 * 1024 / (Sys.word_size / 8) in
let gb_of_mem = Array.init len (fun i -> Some i) in
ignore (Array.copy gb_of_mem : _ array);
Gc.compact ();
let count = ref 0 in
Printf.printf "initialized\n%!";
while true; do
(* And now: allocate continuously. Every 250 words of transiently allocation on
average, keep one word alive in [gb_of_mem] to create holes in any gap. We do not
increase live memory though since we replace some other allocation from [r]. *)
let conses = Random.int 500 / 3 in
r2 := [||];
for _ = 0 to conses - 1; do r := 1 :: !r; done;
gb_of_mem.(Random.int len) <- Some !count;
(Sys.opaque_identity r) := [];
if !count mod 1500000 = 0 then
let live_words = len * 3 (* 1 for the array + 2 for the Some *) in
let stat = Gc.stat () in
Printf.printf "chunks=%d, largest free: %.2fMW, heap_words=%.2fGW, live_fraction=%f%%\n%!"
stat.heap_chunks
(Float.of_int stat.largest_free *. 1e-6)
(Float.of_int stat.heap_words *. 1e-9)
(100. *. Float.of_int live_words /. Float.of_int stat.heap_words)
else if !count mod 10000 = 5000 then
(* from times to times, the program needs to do a big allocation, and because
of all the fragmentation, the gc grows the heap *)
r2 := Array.make 50_000 0;
incr count;
done;
;;
let () = f ()I think this should grow its heap steadily until it eventually compacts with next fit, even though the number of live words is constant. With best fit it will stop growing the heap once the overhead matches the given I'm not sure that these small examples are the best way to evaluate these things, but you did ask. Judging from my results with larger programs, there are other pathological cases fixed by best-fit which don't look like this, but I don't have small examples of those to hand. |
|
(The above example is due to @sliquister) |
|
Thanks @lpw25 . I think this is useful to get an intuition on when this new policy can help. For such a program, I understand that fragmentation is better, and the heap will stop growing earlier than with other strategies. Do you have some idea of the runtime overhead due the new policy (as long as the heap remains reasonnable with former policies)? |
|
Mostly I haven't seen an observable overhead. A couple of programs had some small overhead (1-2%) which seemed to just be a product of having a smaller heap. Such overheads were easily compensated for by just relaxing I think another way to look at the above is that the current GC heuristics for ensuring that In some cases there have been notable improvements in execution times. One real program had a more than 50% reduction in execution time with best-fit because it was hitting a pathological case in the next-fit allocator that lead to it spending 65% of its time traversing the free list. |
|
I don't have time this evening, but I'll read this properly next week. It looks great - the splay tree of lists gives a much more efficient search than the current allocator. I haven't read the code in detail yet, but I'm curious about the design and allocation policy for small allocations. (For large allocations, the policy seems very simple - precise best-fit - and the implementation is really nice). The preallocation mechanism, if I understand it correctly, allocates space for 100 small objects when the first allocation is requested so that the next 99 allocations are fast. This happens if there is no currently available small slot of the correct size or greater. So, if I allocate 500 cons cells, I only do 5 accesses to the splay tree. However, if I allocate a single 16-word object before my 1000 cons cells, then I preallocate 1700 words of space and none of my 500 subsequent cons-cell allocations hit the preallocation fast-path, all hitting the As an alternative, might it be better to avoid ever splitting small blocks, and only merging them if the result of the merge is bigger than Also, have you put any thought into how this design might work with multicore? |
|
I guess that makes sense. If you have fragmentation, not only are you traversing the free list for longer, you're also accessing fresh memory more often, which could reduce cache efficiency, which would dominate other performance considerations. |
runtime/freelist.c
Outdated
| } | ||
| Next_small (Val_hp (p)) = Val_NULL; | ||
| /* Put the chain in the free list. */ | ||
| bf_small_fl[wosz].free = Next_small (Val_hp (big)); |
There was a problem hiding this comment.
This assumes that bf_small_fl[wosz] is currently empty. However, the call to bf_allocate_from_tree above on line 1533 could have added a node to it. This leaks a node and results in an assertion failure with the debug runtime.
There was a problem hiding this comment.
Good catch indeed!
Do you have a simple repro case? I have a fix but I'd like to test it.
There was a problem hiding this comment.
Afraid not. I think it is causing a segfault in this benchmark about 1 time in 20. I'm just checking at the moment that this is indeed the cause.
|
A few answers before I go on vacation:
Our experience so far seems to show that all programs with large heaps will benefit to some extent. In normal cases, because the program fits in a noticeably smaller heap; in pathological cases because the compaction will trigger much less often.
The cost of maintaining and using the best-fit data structures doesn't seem to be higher than the simple next-fit free-list. But the space-time trade-off of the GC is actually controlled by the GC speed feedback loop, and when your heap is smaller (for example because of reduced fragmentation) it will spend more time in the GC. You can then recover your previous speed by increasing the space_overhead parameter of the GC.
Yes, I expect to switch the default to best-fit, as the overhead is (as far as we can tell) negligible.
We could partly alleviate this problem by pre-allocating a fixed number of words (instead of blocks) so the list of 16-word blocks would be much shorter.
I'm really wary of deviating too much from a strict best-fit policy because it's hard to tell what will happen in real programs. For example Wilson et al[1] observes that many programs have multiple phases during which they allocate some set of sizes, so it's expected that a program will allocate lots of 16-word blocks for some time, then suddenly switch to 2-word blocks. If you never split small blocks, you'll be stuck with a long list of 16-word blocks.
The small lists can be per-domain, and the splay tree will need locking. I'm worried about the contention on that lock. Maybe it can be reduced by splaying less often and using a single-writer/multiple-readers lock. If that doesn't work, we'll need to switch to a different data structure (and maybe a relaxed policy). |
Note to our future selfs: there are approaches like "Lazy Splaying" which apparently reduce the contention in concurrent cases whilst still preserving good performance on unbalanced workloads, so we might want to look at them if this does turn out to be an issue. |
eec86b9 to
01a4312
Compare
and allocate directly from the smallest block of the tree. This removes the need for pre-loading the small free lists.
01a4312 to
998cc89
Compare
|
@lpw25 @stedolan This is the version with direct allocation in the smallest heap block. Also available for 4.08 at https://github.com/damiendoligez/ocaml/archive/4.08.1+best-fit.tar.gz |
|
We have a program where repeatedly executing some action lead to memory growing each time significantly (to more than 1.5Gb) with next-fit but remains reasonnable (at around 500Mb) with first-fit; unfortunately, runtime is multiplied almost by 3 with first-fit. Is that the kind of slowdown expected with first-fit, and is there some hope than the new policy would bring the best of both world? (I suspect that fragmentation is due to the use of demarshaling -- I wonder whether demarshalling into small blocks wouldn't be better than the current strategy performance-wise in some cases; more GC-related overhead during demarshaling but less fragmentation...). |
|
Answering to myself, with the help of @nojb who cherry-picked the new policy on our local version: the new policy looks terrific for our use case! Some results running repeatedly some heavy calculation, monitoring the time taken in each run and the process memory usage after it, with the three policies:
The new policy is only a tiny bit slower than policy 0 (perhaps in the noise), and even better than policy 1 in terms of memory usage. With another payload where policy 1 was only slightly slower than policy 0 but clearly better in terms of memory usage, I observe a similar conclusion: policy 2 even better in terms of memory usage, and between 0 and 1 in terms of runtime. |
|
This looks great! I have a couple of notes, but most of them are about comments. |
|
I've been running the latest version with the debug runtime on all our tests and I'm getting an assertion failure: file gc_ctrl.c; line 168 ### Assertion failed: prev_hp == NULL || Color_hp (prev_hp) != Caml_blue || cur_hp == (header_t *) caml_gc_sweep_hp which seems to indicate that there are fragments preceded by blue blocks on the heap, which I did not think was supposed to happen with the latest version. |
Co-Authored-By: Stephen Dolan <mu@netsoc.tcd.ie>
|
The org-mode slides of Damien's talk are now available, as well as the live notes I took during the talk. |
Good idea! Here's an alternative patch along those lines: |
|
I like this version, it seems more robust and probably less run-time overhead. It relies on the property that |
A few hours discussing this with @jhjourdan yielded a way of doing that without re-splaying, for all the needed tree functions. It will be a rather large change for a better worst-case guarantee but probably no noticeable speed gain. We've decided to postpone it to 4.11. |
FTR: you can choose the initial policy with the OCAMLRUNPARAM environment variable. |
|
Congratulation on everyone involved, this is a big, nice change. We have to make sure in the release process that heavy users know to try this new policy, and that they do come and make reports to us about improvements and, more importantly, any potential regression. |
Add a best-fit allocator, which helps a lot with fragmentation, according to some testing done at Jane Street on memory-hungry programs.
Notes for the reviewer:
The main change is
freelist.c. It is now has three main sections:The first two are the old allocator, split into two separate allocators. Reading these sections should be done in parallel with two copies of the old code (git only does the diff with the first, of course). The third is entirely new code.
I had to change the API of
caml_fl_merge_blockto avoid the cost of repeated coalescing in the best-fit case.