Skip to content

Statmemprof optimisation 1/3: optimise call stack collection#8731

Closed
stedolan wants to merge 3 commits intoocaml:trunkfrom
stedolan:optimise-callstacks
Closed

Statmemprof optimisation 1/3: optimise call stack collection#8731
stedolan wants to merge 3 commits intoocaml:trunkfrom
stedolan:optimise-callstacks

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

This is the first of a series of 3 PRs that do some performance optimisation for statmemprof. I've been testing with the following short program, which is a bit of a worst-case example:

let samples = ref 0

let rec recur n f = match n with 0 -> f () | n -> recur (n-1) f + 1

let () =
  Gc.Memprof.start { sampling_rate = 0.01; callstack_size = 10;
                     callback = fun _ -> incr samples; None };

  for i = 1 to 10_000_000 do
    let d = recur 20 (fun () ->
      Array.make 20 0 |> ignore; ref 42 |> Sys.opaque_identity |> ignore; 0) in
    assert (d = 20)
  done;

  Printf.printf "%d\n" !samples

On trunk, this program gets about 2.2x slower with statmemprof, compared to the execution time with the Gc.Memprof line commented out. (The good news is that this is a fairly contrived program, doing nothing but allocation, and even on this program the overhead goes to zero with a sufficiently low parameter).

The overhead is roughly halved with these three patches, with this PR removing about 20% of it.

The major optimisation here, in the first commit, is to avoid traversing the stack twice when collecting a callstack. The previous implementation of callstack collection walked the stack once to work out the length of the callstack, then again to collect the actual data. This is a false economy: the extra copy introduced by first collecting the stack into a too-large buffer is cheaper than the second stack walk.

This simplifies some code in memprof.c: we no longer need to have seperate codepaths for collecting callstacks on major and minor allocations, because the intermediate buffer is not under the control of the GC.

The second commit in this PR is a minor refactoring of caml_next_frame_descriptor, to make it a static function and to pass caml_frame_descriptors and caml_frame_descriptors_mask in local variables rather than reloading them from globals for each callstack entry. This is not because access to globals is especially slow: it is because previously, the C compiler was unable to prove that these globals were not modified during the loop, and so could not cache their values. Now it can cache these values in registers.

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal!

I am generally happy with the idea of the change except for the minor comments I added in the patch, and for the following more important concern:

In #8729, I reuse the same callstacks for several samples. This can give relatively important speedups when sampling unmarshalled data. It seems to me that, as is, this PR prevents doing this (well, one could copy callstacks, but the memory blocks would no longer be shared).

So, I still think this is a better idea to allocate the GC-managed block for the callstack in register_postponed_callback and store a managed pointer in the postponed queue instead of a caml_callstack value.

occurrences, callstack, kind);

if (Is_block(ephe)) caml_ephemeron_set_key(Field(ephe, 0), 0, block);
if (postponed_tl == postponed_hd) purge_postponed_queue();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue with this change is that purge_postponed_queue will not be executed if the callback raises an exception.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I tried to maintain the invariant that, when OCaml is executed, the queue is purged as soon as it is empty. This breaks the invariant, since the queue is purged after the callback is called.

trace_alloc *= 2;
callstack = ret->elems.ptr =
caml_stat_resize(callstack, sizeof(value) * trace_alloc);
if (!callstack) { /* OOM */ ret->length = 0; return; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to free callstack in this case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I guess you want to use caml_stat_resize_noexc instead of caml_stat_resize.

frame_descr * caml_next_frame_descriptor(uintnat * pc, char ** sp)
static inline frame_descr * next_frame_descriptor(uintnat * pc, char ** sp,
frame_descr** frame_descrs,
uintnat frame_descrs_mask)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured a significant speedup for this change on a micro-benchmark? I would have expected it to have a relatively small impact, since these globals are likely to be in L1 cache anyway.

At least, since this makes the code more intricate, some comments should be added to explain that these variables are cached versions of the corresponding global variables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (I think it was around 5-10% of backtrace collection time? Can't remember what proportion of whole-program time that was).

I'll add the comment.

trace_alloc *= 2;
callstack = ret->elems.ptr =
caml_stat_resize(callstack, sizeof(value) * trace_alloc);
if (!callstack) { ret->length = 0; return; } /* OOM */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the corresponding comments in backtrace_byt.c


intnat caml_current_callstack_size(intnat max_frames);
void caml_current_callstack_write(value trace);
enum { Small_callstack_size = 32 };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using an enum instead of either a #define or a const uintnat here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Habit mostly? There are minor advantages for using an enum in C to define integer constants: it's a real symbol (so it's available in gdb, unlike #define), and it doesn't have linkage (so it works in header files, unlike const uintnat).

I think #define is more idiomatic for this codebase though, I'll change it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason in favour of enum: unlike const uintnat it defines a genuine constant, so it can be used to give the dimension of the init array in the elems union below.

@stedolan
Copy link
Copy Markdown
Contributor Author

Thanks for the comments! I'll find time to fix this patch up on Monday.

In #8729, I reuse the same callstacks for several samples. This can give relatively important speedups when sampling unmarshalled data. It seems to me that, as is, this PR prevents doing this (well, one could copy callstacks, but the memory blocks would no longer be shared).

So, I still think this is a better idea to allocate the GC-managed block for the callstack in register_postponed_callback and store a managed pointer in the postponed queue instead of a caml_callstack value.

How much is "relatively important"? Would you mind sending me a copy of whatever you're using to benchmark this?

@jhjourdan
Copy link
Copy Markdown
Contributor

How much is "relatively important"? Would you mind sending me a copy of whatever you're using to benchmark this?

Sorry, I don't have precise benchmarks results available, and I lost the simple program I used to check. I won't have time to redo it until next Wednesday. But this was a rather simple benchmark: unmarshalling a rather long list (length 10000) of integers with a sampling rate of 0.01, and a noop callback.

@stedolan
Copy link
Copy Markdown
Contributor Author

Thanks for the review!

I'm going to put this PR on hold for the moment: the optimisation work can and should be done after the feature work, so I'll reopen this after the other statmemprof PRs are done.

@stedolan stedolan closed this Jul 29, 2019
@jhjourdan jhjourdan mentioned this pull request Sep 5, 2019
@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Feb 3, 2020

Superseded by #9279

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants