Avoid scanning the stack twice when collecting callstacks in Memprof.#9279
Avoid scanning the stack twice when collecting callstacks in Memprof.#9279gasche merged 2 commits intoocaml:trunkfrom
Conversation
jhjourdan
left a comment
There was a problem hiding this comment.
I approve this PR, which seems correct to me. I have a few minor nitpicks, which should not block merging.
runtime/backtrace_byt.c
Outdated
| intnat trace_pos = 0, trace_len = *plen; | ||
| value* trace = *pbuffer; |
There was a problem hiding this comment.
Caching trace_len and trace makes the code harder to read. Have you actually observed some performance benefit doing so?
There was a problem hiding this comment.
The locals are there for correctness in the reallocation case:
trace_len *= 2;
trace = caml_stat_resize_noexc(trace, trace_len * sizeof(value));
if (trace == NULL) break;
*pbuffer = trace;
*plen = trace_len;
If caml_stat_resize_noexc fails, the size has not changed and *plen should not be updated.
There was a problem hiding this comment.
Right. But then I would have only a local at the allocation site.
There was a problem hiding this comment.
I think it's cleaner to just introduce the locals where needed, I'll push a new version.
runtime/backtrace_nat.c
Outdated
| intnat trace_pos = 0, trace_len = *plen; | ||
| value* trace = *pbuffer; |
| } | ||
|
|
||
| for (; trace_pos < trace_size; trace_pos++) { | ||
| while (trace_pos < max_frames) { |
There was a problem hiding this comment.
I think I prefer the while here!
Yes, this is still needed. The reason is that |
Well, actually the comment about |
cdf5c0a to
20f6a17
Compare
@jhjourdan does the last commit fix these? |
jhjourdan
left a comment
There was a problem hiding this comment.
Looks good to me. Ready to merge.
gasche
left a comment
There was a problem hiding this comment.
Approved on Jacques-Henri's behalf.
Avoid scanning the stack twice when collecting callstacks in Memprof. (cherry picked from commit edee8ce)
…edee8ce) Avoid scanning the stack twice when collecting callstacks in Memprof.
…edee8ce) Avoid scanning the stack twice when collecting callstacks in Memprof.
The logic for collecting callstacks for Memprof is split into two passes: first, the stack is scanned to work out its length, a buffer is allocated, and then the stack is scanned again to fill it.
This is a de-optimisation: scanning the stack is much more expensive than reallocating and copying a buffer that turned out to be too short.
This patch does only one stack scan, using an off-heap buffer that's resized if necessary, and copying the result to a correctly-sized buffer afterwards. It makes a simple benchmark that spends ~30% of its time in Memprof goes ~10% faster, so this cuts off close to a third of the Memprof overhead. (Numbers will vary wildly according to stack depth and sampling rate).
There's a further optimisation: memprof.c also caches the last used buffer to save a couple of allocations, as the stack size at the last sample is a good predictor of the stack size at the next.
@jhjourdan: Is the distinction between capture_callstack_postponed and capture_callstack still needed, now that caml_alloc doesn't invoke async callbacks?