Skip to content

Avoid scanning the stack twice when collecting callstacks in Memprof.#9279

Merged
gasche merged 2 commits intoocaml:trunkfrom
stedolan:optimise-callstacks-again
Feb 5, 2020
Merged

Avoid scanning the stack twice when collecting callstacks in Memprof.#9279
gasche merged 2 commits intoocaml:trunkfrom
stedolan:optimise-callstacks-again

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

@stedolan stedolan commented Feb 3, 2020

The logic for collecting callstacks for Memprof is split into two passes: first, the stack is scanned to work out its length, a buffer is allocated, and then the stack is scanned again to fill it.

This is a de-optimisation: scanning the stack is much more expensive than reallocating and copying a buffer that turned out to be too short.

This patch does only one stack scan, using an off-heap buffer that's resized if necessary, and copying the result to a correctly-sized buffer afterwards. It makes a simple benchmark that spends ~30% of its time in Memprof goes ~10% faster, so this cuts off close to a third of the Memprof overhead. (Numbers will vary wildly according to stack depth and sampling rate).

There's a further optimisation: memprof.c also caches the last used buffer to save a couple of allocations, as the stack size at the last sample is a good predictor of the stack size at the next.

@jhjourdan: Is the distinction between capture_callstack_postponed and capture_callstack still needed, now that caml_alloc doesn't invoke async callbacks?

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR, which seems correct to me. I have a few minor nitpicks, which should not block merging.

Comment on lines +284 to +285
intnat trace_pos = 0, trace_len = *plen;
value* trace = *pbuffer;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching trace_len and trace makes the code harder to read. Have you actually observed some performance benefit doing so?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locals are there for correctness in the reallocation case:

      trace_len *= 2;
      trace = caml_stat_resize_noexc(trace, trace_len * sizeof(value));
      if (trace == NULL) break;
      *pbuffer = trace;
      *plen = trace_len;

If caml_stat_resize_noexc fails, the size has not changed and *plen should not be updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But then I would have only a local at the allocation site.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to just introduce the locals where needed, I'll push a new version.

Comment on lines +121 to +122
intnat trace_pos = 0, trace_len = *plen;
value* trace = *pbuffer;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cf. above.

}

for (; trace_pos < trace_size; trace_pos++) {
while (trace_pos < max_frames) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the while here!

@jhjourdan
Copy link
Copy Markdown
Contributor

@jhjourdan: Is the distinction between capture_callstack_postponed and capture_callstack still needed, now that caml_alloc doesn't invoke async callbacks?

Yes, this is still needed. The reason is that caml_alloc_shr gives the additional guarantee that no memory is moved in the OCaml heap, and there is indeed code that calls it without declaring all the roots. Therefore, it is not possible to allocate in the minor heap in capture_callstack_postponed, since this function is called by caml_alloc_shr.

@jhjourdan
Copy link
Copy Markdown
Contributor

I have a few minor nitpicks, which should not block merging.

Well, actually the comment about caml_shutdown is serious and should be addressed since it can lead to UB.

@stedolan stedolan force-pushed the optimise-callstacks-again branch from cdf5c0a to 20f6a17 Compare February 4, 2020 11:03
@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Feb 5, 2020

I have a few minor nitpicks, which should not block merging.

Well, actually the comment about caml_shutdown is serious and should be addressed since it can lead to UB.

@jhjourdan does the last commit fix these?

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Ready to merge.

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on Jacques-Henri's behalf.

@gasche gasche merged commit edee8ce into ocaml:trunk Feb 5, 2020
stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 6, 2020
Avoid scanning the stack twice when collecting callstacks in Memprof.

(cherry picked from commit edee8ce)
stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 17, 2020
…edee8ce)

Avoid scanning the stack twice when collecting callstacks in Memprof.
mshinwell pushed a commit to mshinwell/ocaml that referenced this pull request Apr 7, 2020
…edee8ce)

Avoid scanning the stack twice when collecting callstacks in Memprof.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants