Statistical memory profiling by jhjourdan · Pull Request #847 · ocaml/ocaml

jhjourdan · 2016-10-10T22:33:59Z

This GPR implements a mechanism to statistically profile the heap used by an OCaml program. The sampling rate is tunable so that the overhead can be reduced as much as desired. More information about the general idea in this document.

The patch can divided in different parts:

Additions to the runtime system allowing to statistically sample allocations. Most of the changes in the runtime system is in the file byterun/memprof.c.
Modifications in the native compiler allowing to revert at runtime the optimization of combinations of allocations. Specifically, we store in the frame tables the headers of the allocated blocks together with debugging information even if the allocation point allocate several blocks at once.
New module Memprof in the standard library

This GPR is still a work in progress. I would be happy to receive general comments from the OCaml development team. I have also other concerns:

The user interface is, for now, rather rudimentary: the Memprof module of the standard library provides only a sampling mechanism, and the memprofHelpers.ml file, available at the root of the repository, provides a basic user interface. I think this system could share at least a part of the user interface with the spacetime profiler.
The caml_alloc_shr function and its variants becomes a mess.
- The first reason is that I need, when collecting the minor heap, a way of allocating in the major heap without sampling these allocations.
- The other reason is that, currently, this function is guaranteed to not trigger a GC (that fact is actually used in interp.c), but that contradicts with the sampling mechanism that possibly calls an arbitrary OCaml function at each allocation. As a result, I created yet another variant of caml_alloc_shr, allowing GC calls, which is called as often as possible. So here is my question: apart from the OCaml runtime system, do you think there is much code in the wild exploiting the fact that caml_alloc_shr does not call the GC? Would it be possible to remove this guarantee?

We will put them back if profiling shows it is really necessary.

…t with unit.

nojb · 2016-10-11T07:10:24Z

Could you say a few words about how this work compares with Spacetime ? When would one use one or the other ? What are the advantages/disadvantages of one over the other ?

lefessan · 2016-10-11T08:44:37Z

I won't comment more on this PR, as I am biaised towards OCamlPro's Memory Profiler (ocp-memprof), but since you compared the two memory profilers in your talk, I will give my own comparison here:

ocp-memprof had no additionnal cost in memory (the only information is stored in the standard block header) and is always active (you can run the program in production, and just ask for a snapshot at any time), which is not the case of this work: you need to have started the profiling at the beginning, thus leaking memory for the profiler itself;
with this work, you get information on the backtraces at allocation points (as with Spacetime, but with a much smaller cost), but to recover the types, you need to go back the sources. ocp-memprof on the contrary was able to recover types directly from the allocation points, which made it possible to aggregate the types, and have an immediate interpretation of the results.
ocp-memprof could store the graph of pointers in its (compressed) snapshots, which means you could go back from the (problematic) blocks to the roots that retained them, an information that is not available with the census provided by this work and Spacetime. More generally, many computations were possible offline on ocp-memprof snapshots.

FWIW, I am sad to see so much work (with this work and Spacetime) on building a competitor for ocp-memprof, whose license was really cheap and provided full access to the sources, and for which there is now little interest for OCamlPro to work on. Sometimes, it's better to pay a little to get a well-crafted tool with an efficient GUI than a plethora of unmaintained prototypes.

mshinwell · 2016-10-11T09:03:46Z

@lefessan I think many people would be pleased if OCamlPro would consider contributing the code for storing a compressed version of the heap graph (as applied to Spacetime snapshots). Although I understand this might not be possible for commercial reasons.

lefessan · 2016-10-11T09:50:11Z

@mshinwell

Although I understand this might not be possible for commercial reasons.

It's not a matter of commercial reasons, it's just that we just have no good reason to spend any more time (and thus, money) on this subject. And actually, technically, releasing this code would be useless without releasing the code to analyse them, i.e. all ocp-memprof.

jhjourdan · 2016-10-11T17:48:33Z

Could you say a few words about how this work compares with Spacetime ? When would one use one or the other ? What are the advantages/disadvantages of one over the other ?

From a technical point of view, this is very different to Spacetime: spacetime not only profiles allocation it also constructs a call graph that can be used for debugging as well as for profiling purposes.

On the other hand, this approach chooses at random only a few allocations, and gives the user the opportunity to gather a lot of information about the state of the program when allocating. This has two major advantages: first, this is much more lightweight, making it possible to enable it in production with almost no overhead. Second, because any user-chosen function can be called when sampling, it gives more flexibility to which information is gathered.

jhjourdan · 2016-10-11T17:53:04Z

you need to have started the profiling at the beginning, thus leaking memory for the profiler itself;

This is true, but a little bit exaggerated: first, the memory overhead is, in fact, negligible if the sampling rate is low enough and second, the memory is not "leaked", because it is recovered as soon as the heap is freed.

When configured with --statmemprof, there is still a use of VLAs in memprof.c, making the code non C90-compliant and hence not compatible with MSVC.

lefessan · 2016-10-12T12:08:07Z

@jhjourdan

the memory overhead is, in fact, negligible if the sampling rate is low enough

Do you have numbers ? For example on Coq, where backtraces might become large with some tactics ?

the memory is not "leaked", because it is recovered as soon as the heap is freed.

What I meant was, if the program is leaking memory, it will leak even more memory, since these leaked blocks are never freed, neither are the attached backtraces.

jhjourdan · 2016-10-12T15:28:11Z

the memory overhead is, in fact, negligible if the sampling rate is low enough

Let us consider an extremely bad case : the captured callstacks have average length 10^4. Then, if we set the sampling rate to 10^-5 (with which you already get much statistical information), then the memory overhead is bounded by 10%.

Given that, in practice, I expect callstacks to be much smaller (that would mean that the program is quite close to stack overflow, in which case the programmer has probably other problems in her head), I really think that the memory overhead is negligible.

If you are still worried about you memory being filled with callstacks, you can still either compress them when sampling or limit their size.

…trunk

braibant · 2019-03-15T12:18:59Z

I experimented a bit with statmemprof. Here are some observations:

The backtrace does not contain enough information to retrieve the complete filename (it only provides the basename). My understanding is that this is due to the limited information available in Printexc.location.
The memory profiler has no knowledge of the data allocated outside of the OCaml heap. Two examples I care about are Bigstring.t (bigarrays -- custom blocks with small length on the OCaml heap) and Zarith.Q.t (pairs on the ocaml heap), where the actual value lives on the C heap. The samples coming from statmemprof were roughly in line with what I saw using more crude measures (htop), but there were discrepancies potentially coming with the above.
I have not seen, but not measured either, a performance hit when using low enough sampling rates. I have not tried running the statistical memory profiler with high sampling rates.
The profiler exposes data about the tag of the blocks being allocated and the size of the block. I tried figuring out if there is any way to use the tag for analysis, in a way that can be documented, but did not arrive at something satisfying.
The driver that I used is an adapted version of https://opam.ocaml.org/packages/statmemprof-emacs/. The emacs integration is not something that is easy to port and I replaced this with a simple file backend. However, my understanding is that someone is working on a better terminal UI. That would be great.
Most of my runs where showing that almost 100% of the data was allocated with allocation kind minor. This is expected since the values were initially allocated on the minor heap, rather than them being still on the minor heap when sampled. Some documentation about typical use cases should maybe mention that the profiler does not allow us to inspect the "age" of a value.
A follow up of the above: would it be possible to have an alloc_kind tracking compaction? Would it be useful (the interesting bit there would be values surviving multiple compactions...)

Things that might be interesting to know / do.

Testing the performance regression in the null-case scenario (runtime patched with the statistical memory profiler vs unpatched runtime)
Testing the performance regression as a function of the sampling rate
Using Add caml_alloc_custom_mem #1738, the new custom block allocation functions could allow us to track in some capacity the data allocated on the C heap.

I am very enthusiastic about this work, and would love to see it reviewed and merged. I don't think the integration with the spacetime viewer is a pre-requisite.

gasche · 2019-03-15T12:28:01Z

Nice! Thanks a lot @braibant. Two quick comments on low-hanging fruits/remarks:

The memory profiler has no knowledge of the data allocated outside of the OCaml heap. Two examples I care about are Bigstring.t (bigarrays -- custom blocks with small length on the OCaml heap) and Zarith.Q.t (pairs on the ocaml heap), where the actual value lives on the C heap.

At the Mirage retreat @hannesm played with statmemprof and had the same issue (bigarrays are heavily used in the Mirage codebase due to Cstruct), and @chambart helped him do a sort of ad-hoc hack to track bigarray allocations (adding instrumentation directly in the C implementation of bigarrays) -- if I understand correctly, in a separate ephemeron. It would be nicer to have built-in support for out-of-heap resources (and why not also the "virtual cost" API of custom values) built in statmemprof, which might emerge as an enhancement PR from @hannesm / @chambart's experiment, but I would not bet on that -- someone with the resource to do something complete for upstreaming may go quicker than this pair of already-overloaded programmers.

(For Zarith, I would assume that the out-of-heap size is relatively small, and thus fairly proportional to the OCaml-side tracking? Do bignum get arbirary large in real-world computations, the way Bigarrays do?)

The profiler exposes data about the tag of the blocks being allocated and the size of the block. I tried figuring out if there is any way to use the tag for analysis, in a way that can be documented, but did not arrive at something satisfying.

I know that @let-def has done some amusing experiments encoding type information in "the rest of the header" (what is typically used by ocp-memprof or spacetim), it might be possible to combine the two works in interesting ways!

mshinwell · 2019-03-15T14:17:51Z

@jhjourdan In terms of getting this upstreamed, I think we need to break this down into its constituent parts, much like I'm doing for the gdb work. I believe there is sufficient consensus as to the overall aim here that we should merge individual parts of the work as they become ready, even if some parts have a few ragged edges and stubs, rather than doing everything strictly in dependency order. My previous experience has shown that the latter often leads to long delays.

As discussed on caml-devel, JS are willing to devote resources to reviewing this from approximately the start of April onwards, and @let-def has kindly agreed to contribute some of his time as well. We can do the splitting up of the patch at that time if necessary. The rough idea so far is to concentrate on getting the core parts in for 4.09 and defer some of the more elaborate pieces for 4.10.

There are two elaborate pieces that I have thought about so far:

The complications (which I haven't completely read in detail yet) about caml_alloc_shr and so forth, which I think stem from the decision to allow arbitrary OCaml code to run on allocations triggered from C code. (This of course includes all major heap allocations.) How about we simplify this for a first version? Here is a sketch of a plan. Add a C hook that is called at the relevant point during allocation. Add a C function that can be called by the implementation of such hook in order to trigger a previously-registered OCaml callback at a later stage. The deferral mechanism here would be much like signal handling. This would mean that all allocations from C that issue from a given external call in OCaml code would be lumped together. However for a first version, this would probably dramatically simplify the code, and practical experience suggests that the sort of allocations involved here (e.g. create an array, create a string, etc.) are "nearly" uniquely determined from the point in OCaml code at which they are called. (At least when a human is involved, looking at the output.)
The Comballoc-related code (which incidentally will have merge conflicts with the gdb work; I can help resolve that). This seems as if it would be a nice candidate to defer until 4.10, but unfortunately it doesn't seem obvious how to do this, without losing the required statistical properties and risking some allocations never being sampled at all.

@lpw25 and I have discussed what to do about the problem relating to backtraces that @braibant mentions. We think we have an approach that should be straightforward to implement without much code: add some functionality to retrieve the backtrace as a list of return addresses. These can then be put through the same mechanisms (which if I remember correctly use @let-def 's owee library) as used in the Spacetime viewer to retrieve the full source pathname, etc. from the DWARF information in the executable.

jhjourdan · 2019-03-15T16:50:51Z

@braibant Thanks for the feedback!

The memory profiler has no knowledge of the data allocated outside of the OCaml heap. Two examples I care about are Bigstring.t (bigarrays -- custom blocks with small length on the OCaml heap) and Zarith.Q.t (pairs on the ocaml heap), where the actual value lives on the C heap. The samples coming from statmemprof were roughly in line with what I saw using more crude measures (htop), but there were discrepancies potentially coming with the above.

I can indeed plan to support out-of heap support as an improvement of the current implementation. I do not think this is a difficult addition. It will require, however, some support from the corresponding C libraries. This could be as simple as requiring them to use caml_alloc_custom_mem.

The driver that I used is an adapted version of https://opam.ocaml.org/packages/statmemprof-emacs/. The emacs integration is not something that is easy to port and I replaced this with a simple file backend. However, my understanding is that someone is working on a better terminal UI. That would be great.

AFAIK, @let-def was already doing something in that direction.

Some documentation about typical use cases should maybe mention that the profiler does not allow us to inspect the "age" of a value.

I would say that this is something that can be done outside of the OCaml runtime, e.g., as part of statmemprof-emacs or similar tooling. It is easy to imagine that the sampling callback inspects e.g., the current time and the current status of the GC (via GC.quick_stat) to remember the date of birth of each block.

jhjourdan · 2019-03-15T17:21:12Z

The complications (which I haven't completely read in detail yet) about caml_alloc_shr and so forth, which I think stem from the decision to allow arbitrary OCaml code to run on allocations triggered from C code. (This of course includes all major heap allocations.) How about we simplify this for a first version? Here is a sketch of a plan. Add a C hook that is called at the relevant point during allocation. Add a C function that can be called by the implementation of such hook in order to trigger a previously-registered OCaml callback at a later stage. The deferral mechanism here would be much like signal handling. This would mean that all allocations from C that issue from a given external call in OCaml code would be lumped together. However for a first version, this would probably dramatically simplify the code, and practical experience suggests that the sort of allocations involved here (e.g. create an array, create a string, etc.) are "nearly" uniquely determined from the point in OCaml code at which they are called. (At least when a human is involved, looking at the output.)

There already is such a mechanism in this merge request, and this deferral mechanism is the source of much of the complication, since it requires a specific data structure for recording the deferred allocations. The handling of non-deffered allocations is actually rather simple.

The only simplification of your proposal (deferring all C allocations) would be in the public interface of caml_alloc_shr, not in its implementation. There will anyway be some complexity related to the fact that the minor GC must be able to call a special version of caml_alloc_shr, since sampling should be deactivated during minor GC. From my point of view, the added complexity of the interface of caml_alloc_shr is by far justified by the fact that backtraces for e.g., arrays or strings are well-located.

jhjourdan · 2019-03-15T17:24:04Z

The Comballoc-related code (which incidentally will have merge conflicts with the gdb work; I can help resolve that). This seems as if it would be a nice candidate to defer until 4.10, but unfortunately it doesn't seem obvious how to do this, without losing the required statistical properties and risking some allocations never being sampled at all.

I agree that the Comballoc adds a large amount of complexity to the code. Perhaps a solution for a first version would be to have statmemprof activated only with a specific configure option, and this configure option would deactivate Comballoc? I don't have a clear understanding of the performance impact of deactivating Comballoc, though. @xavierleroy?

jhjourdan · 2019-03-15T17:25:32Z

@jhjourdan In terms of getting this upstreamed, I think we need to break this down into its constituent parts, much like I'm doing for the gdb work. I believe there is sufficient consensus as to the overall aim here that we should merge individual parts of the work as they become ready, even if some parts have a few ragged edges and stubs, rather than doing everything strictly in dependency order. My previous experience has shown that the latter often leads to long delays.

Alright. Then I'll try to prepare smaller patches to review when I'll have a bit more time than now.

mshinwell · 2019-03-15T17:26:40Z

@jhjourdan OK, I'm going to have to look more carefully at the details for the caml_alloc_shr code.

hannesm · 2019-03-16T14:28:40Z

Some documentation about typical use cases should maybe mention that the profiler does not allow us to inspect the "age" of a value.

I would say that this is something that can be done outside of the OCaml runtime, e.g., as part of statmemprof-emacs or similar tooling. It is easy to imagine that the sampling callback inspects e.g., the current time and the current status of the GC (via GC.quick_stat) to remember the date of birth of each block.

@chambart presented statmemprof at the MirageOS retreat in Marrakesh last week. He used a slightly modified statmemprof-emacs which includes 3 more numbers (roughly GC generation number): generation of first allocation, generation of last allocation, average generation. If the first and last are 0 and max, and average is more than the middle, the allocation may be a leak (worth to look for these) - I find these additions incredible helpful.

out of OCaml heap allocation

yes, this would be great to have integrated. I'm not sure whether it should be statistical as well, or taking into account every allocation. Some initial places where code needs to be hooked were identified (esp. for bigarray), I will see whether I can further develop something.

file backend

I adapted the statmemprof-emacs to work in a MirageOS unikernel, using TCP to transport data from the inside to the emacs proxy (using Marshal), see https://github.com/hannesm/statmemprof-mirage. While doing this, I ran into the issue that Printexc.raw_backtrace (part of Memprof.sample_info) is not safe to Marshal. I use Printexc.backtrace_slots info.Memprof.callstack and Marshal the backtrace_slot array option, which works fine.

-> Could the Memprof.sample_info contain backtrace_slot array option instead of a raw_backtrace, or are there usages where the raw_backtrace is needed, or is the function backtrace_slots too expensive?

gasche · 2019-03-16T14:39:11Z

Converting a raw_bactrace into a higher-level representation has a noticeable cost (backtraces can be long); in fact we introduced raw_backtrace precisely for the use-case of Statmemprof. I think having the lowest-level representation, with clearly documented tools to let users that need it convert to higher-level representations, is the right design for a profiling tool designed for low overhead.

…f-memory conditions.

lpw25 · 2020-05-18T08:26:19Z

A substantially updated version of this work has now been merged.

jhjourdan · 2020-05-20T08:34:21Z

Thanks, @lpw25!

jhjourdan added 11 commits October 8, 2016 13:28

Statistical memory profiling.

2c93ca1

Depend files, a few obvious bugs.

e1ec4c8

Fix a few compilation bugs.

870b0f1

Get rid of exp_approx and log_approx.

56a1c92

We will put them back if profiling shows it is really necessary.

Merge remote-tracking branch 'origin/trunk' into memprof_trunk

926a293

Silly configuration bug.

2609cc4

Refactoring in memprof.c

6375ac0

Add memprofHelpers.ml

1d95d5f

Small optim in memprof.c

8de04b1

The parameter of select_allocation is irrelevent anyway. We replace i…

62d9906

…t with unit.

Typographical changes.

07af252

jhjourdan added 4 commits October 11, 2016 20:31

Fix build with MSVC.

c466e07

When configured with --statmemprof, there is still a use of VLAs in memprof.c, making the code non C90-compliant and hence not compatible with MSVC.

Again, fix build with MSVC.

ed1bb52

No longer use VLA in memprof.c.

ea6a186

Update README.

df313c1

mshinwell changed the title ~~[WIP] Statistical memory profiling - Request for comments~~ Statistical memory profiling - Request for comments Dec 28, 2016

mshinwell added the work-in-progress label Dec 28, 2016

gasche added the caml-weekly-news label Jan 7, 2017

damiendoligez added this to the long-term milestone Sep 29, 2017

jhjourdan added 3 commits November 6, 2017 20:43

Merge commit '10c8e77bbfdc87622e3a99fd07c1b4fe935f7a7f' into memprof_…

426670c

…trunk

Fix bug when allocating large closures.

e2e307d

Merge branch 'trunk' into memprof_trunk

99daed3

mshinwell changed the title ~~Statistical memory profiling - Request for comments~~ Statistical memory profiling Mar 15, 2019

jhjourdan added 2 commits April 16, 2019 17:32

Merge branch 'trunk' into memprof_trunk

eec1f96

Deprecate the Out_of_memory exception, simplify the handling of out-o…

53e991c

…f-memory conditions.

jhjourdan mentioned this pull request Apr 18, 2019

Deprecate the Out_of_memory exception #8628

Closed

jhjourdan added 8 commits April 19, 2019 00:06

Update Changes.

0898212

Long lines.

cc67352

Merge branch 'no_oom' into memprof_trunk

a0536f8

Fix configure for memprof.

8fcaca0

Get rid of CAML_ALLOC_EFFECT_GC

65034be

Get rid of the alloc_shr_effect functions.

81be36d

Merge branch 'trunk' into no_oom

20d1289

Merge branch 'no_oom' into memprof_trunk

95be680

jhjourdan mentioned this pull request Apr 23, 2019

Statistical memory profiling, part 1: blocks allocated in the major heap #8634

Merged

jhjourdan added 2 commits May 9, 2019 18:25

Merge branch 'trunk' into no_oom

ecba3f8

Merge branch 'no_oom' into memprof_trunk

fc80af5

lpw25 closed this May 18, 2020

stedolan pushed a commit to stedolan/ocaml that referenced this pull request Oct 25, 2022

Fix eliding of inlining arguments during printing (ocaml#847)

8b360d5

gasche mentioned this pull request Jan 19, 2023

statmemprof is absent in OCaml 5.0 #11911

Closed

Conversation

jhjourdan commented Oct 10, 2016

Uh oh!

nojb commented Oct 11, 2016

Uh oh!

lefessan commented Oct 11, 2016

Uh oh!

mshinwell commented Oct 11, 2016

Uh oh!

lefessan commented Oct 11, 2016

Uh oh!

jhjourdan commented Oct 11, 2016

Uh oh!

jhjourdan commented Oct 11, 2016

Uh oh!

lefessan commented Oct 12, 2016

Uh oh!

jhjourdan commented Oct 12, 2016

Uh oh!

braibant commented Mar 15, 2019 • edited by nojb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Mar 15, 2019

Uh oh!

mshinwell commented Mar 15, 2019

Uh oh!

jhjourdan commented Mar 15, 2019

Uh oh!

jhjourdan commented Mar 15, 2019

Uh oh!

jhjourdan commented Mar 15, 2019

Uh oh!

jhjourdan commented Mar 15, 2019

Uh oh!

mshinwell commented Mar 15, 2019

Uh oh!

hannesm commented Mar 16, 2019

Uh oh!

gasche commented Mar 16, 2019

Uh oh!

lpw25 commented May 18, 2020

Uh oh!

jhjourdan commented May 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

braibant commented Mar 15, 2019 •

edited by nojb

Loading