Statmemprof optimisation 1/3: optimise call stack collection#8731
Statmemprof optimisation 1/3: optimise call stack collection#8731stedolan wants to merge 3 commits intoocaml:trunkfrom
Conversation
jhjourdan
left a comment
There was a problem hiding this comment.
Thanks for the proposal!
I am generally happy with the idea of the change except for the minor comments I added in the patch, and for the following more important concern:
In #8729, I reuse the same callstacks for several samples. This can give relatively important speedups when sampling unmarshalled data. It seems to me that, as is, this PR prevents doing this (well, one could copy callstacks, but the memory blocks would no longer be shared).
So, I still think this is a better idea to allocate the GC-managed block for the callstack in register_postponed_callback and store a managed pointer in the postponed queue instead of a caml_callstack value.
| occurrences, callstack, kind); | ||
|
|
||
| if (Is_block(ephe)) caml_ephemeron_set_key(Field(ephe, 0), 0, block); | ||
| if (postponed_tl == postponed_hd) purge_postponed_queue(); |
There was a problem hiding this comment.
An issue with this change is that purge_postponed_queue will not be executed if the callback raises an exception.
There was a problem hiding this comment.
Also, I tried to maintain the invariant that, when OCaml is executed, the queue is purged as soon as it is empty. This breaks the invariant, since the queue is purged after the callback is called.
| trace_alloc *= 2; | ||
| callstack = ret->elems.ptr = | ||
| caml_stat_resize(callstack, sizeof(value) * trace_alloc); | ||
| if (!callstack) { /* OOM */ ret->length = 0; return; } |
There was a problem hiding this comment.
Don't we need to free callstack in this case?
There was a problem hiding this comment.
Also, I guess you want to use caml_stat_resize_noexc instead of caml_stat_resize.
| frame_descr * caml_next_frame_descriptor(uintnat * pc, char ** sp) | ||
| static inline frame_descr * next_frame_descriptor(uintnat * pc, char ** sp, | ||
| frame_descr** frame_descrs, | ||
| uintnat frame_descrs_mask) |
There was a problem hiding this comment.
Have you measured a significant speedup for this change on a micro-benchmark? I would have expected it to have a relatively small impact, since these globals are likely to be in L1 cache anyway.
At least, since this makes the code more intricate, some comments should be added to explain that these variables are cached versions of the corresponding global variables.
There was a problem hiding this comment.
Yes (I think it was around 5-10% of backtrace collection time? Can't remember what proportion of whole-program time that was).
I'll add the comment.
| trace_alloc *= 2; | ||
| callstack = ret->elems.ptr = | ||
| caml_stat_resize(callstack, sizeof(value) * trace_alloc); | ||
| if (!callstack) { ret->length = 0; return; } /* OOM */ |
There was a problem hiding this comment.
See the corresponding comments in backtrace_byt.c
|
|
||
| intnat caml_current_callstack_size(intnat max_frames); | ||
| void caml_current_callstack_write(value trace); | ||
| enum { Small_callstack_size = 32 }; |
There was a problem hiding this comment.
Why using an enum instead of either a #define or a const uintnat here?
There was a problem hiding this comment.
Habit mostly? There are minor advantages for using an enum in C to define integer constants: it's a real symbol (so it's available in gdb, unlike #define), and it doesn't have linkage (so it works in header files, unlike const uintnat).
I think #define is more idiomatic for this codebase though, I'll change it.
There was a problem hiding this comment.
Another reason in favour of enum: unlike const uintnat it defines a genuine constant, so it can be used to give the dimension of the init array in the elems union below.
|
Thanks for the comments! I'll find time to fix this patch up on Monday.
How much is "relatively important"? Would you mind sending me a copy of whatever you're using to benchmark this? |
Sorry, I don't have precise benchmarks results available, and I lost the simple program I used to check. I won't have time to redo it until next Wednesday. But this was a rather simple benchmark: unmarshalling a rather long list (length 10000) of integers with a sampling rate of 0.01, and a noop callback. |
|
Thanks for the review! I'm going to put this PR on hold for the moment: the optimisation work can and should be done after the feature work, so I'll reopen this after the other statmemprof PRs are done. |
|
Superseded by #9279 |
This is the first of a series of 3 PRs that do some performance optimisation for statmemprof. I've been testing with the following short program, which is a bit of a worst-case example:
On trunk, this program gets about 2.2x slower with statmemprof, compared to the execution time with the
Gc.Memprofline commented out. (The good news is that this is a fairly contrived program, doing nothing but allocation, and even on this program the overhead goes to zero with a sufficiently low parameter).The overhead is roughly halved with these three patches, with this PR removing about 20% of it.
The major optimisation here, in the first commit, is to avoid traversing the stack twice when collecting a callstack. The previous implementation of callstack collection walked the stack once to work out the length of the callstack, then again to collect the actual data. This is a false economy: the extra copy introduced by first collecting the stack into a too-large buffer is cheaper than the second stack walk.
This simplifies some code in
memprof.c: we no longer need to have seperate codepaths for collecting callstacks on major and minor allocations, because the intermediate buffer is not under the control of the GC.The second commit in this PR is a minor refactoring of caml_next_frame_descriptor, to make it a
staticfunction and to passcaml_frame_descriptorsandcaml_frame_descriptors_maskin local variables rather than reloading them from globals for each callstack entry. This is not because access to globals is especially slow: it is because previously, the C compiler was unable to prove that these globals were not modified during the loop, and so could not cache their values. Now it can cache these values in registers.