Statmemprof optimisation 1/3: optimise call stack collection by stedolan · Pull Request #8731 · ocaml/ocaml

stedolan · 2019-06-11T11:53:08Z

This is the first of a series of 3 PRs that do some performance optimisation for statmemprof. I've been testing with the following short program, which is a bit of a worst-case example:

let samples = ref 0

let rec recur n f = match n with 0 -> f () | n -> recur (n-1) f + 1

let () =
  Gc.Memprof.start { sampling_rate = 0.01; callstack_size = 10;
                     callback = fun _ -> incr samples; None };

  for i = 1 to 10_000_000 do
    let d = recur 20 (fun () ->
      Array.make 20 0 |> ignore; ref 42 |> Sys.opaque_identity |> ignore; 0) in
    assert (d = 20)
  done;

  Printf.printf "%d\n" !samples

On trunk, this program gets about 2.2x slower with statmemprof, compared to the execution time with the Gc.Memprof line commented out. (The good news is that this is a fairly contrived program, doing nothing but allocation, and even on this program the overhead goes to zero with a sufficiently low parameter).

The overhead is roughly halved with these three patches, with this PR removing about 20% of it.

The major optimisation here, in the first commit, is to avoid traversing the stack twice when collecting a callstack. The previous implementation of callstack collection walked the stack once to work out the length of the callstack, then again to collect the actual data. This is a false economy: the extra copy introduced by first collecting the stack into a too-large buffer is cheaper than the second stack walk.

This simplifies some code in memprof.c: we no longer need to have seperate codepaths for collecting callstacks on major and minor allocations, because the intermediate buffer is not under the control of the GC.

The second commit in this PR is a minor refactoring of caml_next_frame_descriptor, to make it a static function and to pass caml_frame_descriptors and caml_frame_descriptors_mask in local variables rather than reloading them from globals for each callstack entry. This is not because access to globals is especially slow: it is because previously, the C compiler was unable to prove that these globals were not modified during the loop, and so could not cache their values. Now it can cache these values in registers.

jhjourdan

Thanks for the proposal!

I am generally happy with the idea of the change except for the minor comments I added in the patch, and for the following more important concern:

In #8729, I reuse the same callstacks for several samples. This can give relatively important speedups when sampling unmarshalled data. It seems to me that, as is, this PR prevents doing this (well, one could copy callstacks, but the memory blocks would no longer be shared).

So, I still think this is a better idea to allocate the GC-managed block for the callstack in register_postponed_callback and store a managed pointer in the postponed queue instead of a caml_callstack value.

jhjourdan · 2019-06-11T13:24:42Z

runtime/memprof.c

+                       occurrences, callstack, kind);

    if (Is_block(ephe)) caml_ephemeron_set_key(Field(ephe, 0), 0, block);
+    if (postponed_tl == postponed_hd) purge_postponed_queue();


An issue with this change is that purge_postponed_queue will not be executed if the callback raises an exception.

Also, I tried to maintain the invariant that, when OCaml is executed, the queue is purged as soon as it is empty. This breaks the invariant, since the queue is purged after the callback is called.

jhjourdan · 2019-06-11T13:29:54Z

runtime/backtrace_byt.c

+        trace_alloc *= 2;
+        callstack = ret->elems.ptr =
+          caml_stat_resize(callstack, sizeof(value) * trace_alloc);
+        if (!callstack) { /* OOM */ ret->length = 0; return; }


Don't we need to free callstack in this case?

Also, I guess you want to use caml_stat_resize_noexc instead of caml_stat_resize.

jhjourdan · 2019-06-11T13:39:50Z

runtime/backtrace_nat.c

-frame_descr * caml_next_frame_descriptor(uintnat * pc, char ** sp)
+static inline frame_descr * next_frame_descriptor(uintnat * pc, char ** sp,
+                                                  frame_descr** frame_descrs,
+                                                  uintnat frame_descrs_mask)


Have you measured a significant speedup for this change on a micro-benchmark? I would have expected it to have a relatively small impact, since these globals are likely to be in L1 cache anyway.

At least, since this makes the code more intricate, some comments should be added to explain that these variables are cached versions of the corresponding global variables.

Yes (I think it was around 5-10% of backtrace collection time? Can't remember what proportion of whole-program time that was).

I'll add the comment.

jhjourdan · 2019-06-11T13:42:58Z

runtime/backtrace_nat.c

+        trace_alloc *= 2;
+        callstack = ret->elems.ptr =
+          caml_stat_resize(callstack, sizeof(value) * trace_alloc);
+        if (!callstack) { ret->length = 0; return; } /* OOM */


See the corresponding comments in backtrace_byt.c

jhjourdan · 2019-06-11T13:44:43Z

runtime/caml/backtrace_prim.h


-intnat caml_current_callstack_size(intnat max_frames);
-void caml_current_callstack_write(value trace);
+enum { Small_callstack_size = 32 };


Why using an enum instead of either a #define or a const uintnat here?

Habit mostly? There are minor advantages for using an enum in C to define integer constants: it's a real symbol (so it's available in gdb, unlike #define), and it doesn't have linkage (so it works in header files, unlike const uintnat).

I think #define is more idiomatic for this codebase though, I'll change it.

Another reason in favour of enum: unlike const uintnat it defines a genuine constant, so it can be used to give the dimension of the init array in the elems union below.

stedolan · 2019-06-19T16:36:09Z

Thanks for the comments! I'll find time to fix this patch up on Monday.

In #8729, I reuse the same callstacks for several samples. This can give relatively important speedups when sampling unmarshalled data. It seems to me that, as is, this PR prevents doing this (well, one could copy callstacks, but the memory blocks would no longer be shared).

So, I still think this is a better idea to allocate the GC-managed block for the callstack in register_postponed_callback and store a managed pointer in the postponed queue instead of a caml_callstack value.

How much is "relatively important"? Would you mind sending me a copy of whatever you're using to benchmark this?

jhjourdan · 2019-06-19T21:10:53Z

How much is "relatively important"? Would you mind sending me a copy of whatever you're using to benchmark this?

Sorry, I don't have precise benchmarks results available, and I lost the simple program I used to check. I won't have time to redo it until next Wednesday. But this was a rather simple benchmark: unmarshalling a rather long list (length 10000) of integers with a sampling rate of 0.01, and a noop callback.

stedolan · 2019-07-29T08:50:47Z

Thanks for the review!

I'm going to put this PR on hold for the moment: the optimisation work can and should be done after the feature work, so I'll reopen this after the other statmemprof PRs are done.

stedolan · 2020-02-03T11:01:18Z

Superseded by #9279

stedolan added 3 commits June 11, 2019 12:37

Do only one traversal of the stack when collecting callstacks.

a564fc7

Make caml_next_frame_descriptor use registers for locals

c308c43

Changes

4376d0c

jhjourdan requested changes Jun 11, 2019

View reviewed changes

stedolan mentioned this pull request Jun 19, 2019

Memprof support for unmarshalled data #8729

Merged

1 task

stedolan closed this Jul 29, 2019

jhjourdan mentioned this pull request Sep 5, 2019

Memprof new api #8920

Merged

jhjourdan mentioned this pull request Apr 18, 2020

Memprof: optimize random sampling #9466

Merged

Conversation

stedolan commented Jun 11, 2019

Uh oh!

jhjourdan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stedolan commented Jun 19, 2019

Uh oh!

jhjourdan commented Jun 19, 2019

Uh oh!

stedolan commented Jul 29, 2019

Uh oh!

stedolan commented Feb 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants