Statmemprof resurrected #12923

NickBarnes · 2024-01-19T17:29:58Z

This PR resurrects the statistical memory profiler "statmemprof", which was present in OCaml 4.14 but removed as part of the multicore merge.

This PR description covers a lot of statmemprof design and implementation points, including those implemented in earlier PRs, to bring everything together in one place.

Overview

Statmemprof allows sampling of all memory allocation at some frequency ("lambda") per word, without recompilation or invasive instrumentation. Callbacks are made into Caml for every sampled block, at allocation, promotion, and de-allocation. Allocation callbacks optionally take the stack backtrace as argument. There is a tradeoff between sampling frequency and overhead, and the overhead is very low for lambda below 10^-2.

There are semantic and implementation considerations for statmemprof on multicore OCaml. These are resolved here, as described below, following conversation on #11911, #12379, #12381, and elsewhere. The choices made here allow further development of the statmemprof API, which is outside the scope of this PR.

Semantics

Semantics are as described in #12381, described again here for completeness.

Multicore OCaml introduces domains, which are deeply connected to memory allocation and garbage collection. So it is natural for allocation sampling to be controlled on a per-domain basis. In contrast, it would be difficult to control sampling synchronously for all domains, or to control sampling on a per-thread or per-fiber basis, so that is not attempted in this PR. Rather: all code executing in a single domain shares the same sampling behaviour, whereas code running in distinct domains may sample differently (with different frequencies, callbacks, and callstack depth). This is the natural extension of OCaml 4 statmemprof, in which, similarly, distinct threads and fibers are all profiled together.

Like OCaml 4 statmemprof:

Statmemprof callbacks have the same types as in OCaml 4;
Allocation sampling is not per-thread or per-fiber;
Callbacks may be deferred to a "safe point", shortly after the profiled event (allocation, promotion, or deallocation);
Despite this deferral, the callstack passed to an allocation callback accurately reflects the stack at the moment of allocation;
If an allocation or promotion callback raises an exception or returns None, statmemprof will stop tracking that particular block;
No allocation caused by a statmemprof callback is sampled;
An allocation callback is always run by the thread which allocated the block;
Promotion and de-allocation callbacks are not constrained to be run by the allocating thread.

Unlike OCaml 4 statmemprof:

There is a type Gc.Memprof.t, representing "a profile". This allows profiling tools to meaningfully distinguish and control multiple profiles within a single program;
Long-lived programs are able to profile memory allocated during a single phase of program execution, including promotions and de-allocations which happen after the phase has concluded, by changing the behaviour of Gc.Memprof.stop and introducing Gc.Memprof.discard:
- Gc.Memprof.stop() stops sampling, but does not stop promotion and de-allocation callbacks. These callbacks may be called after sampling has stopped, including while some other profile is sampling.
- The new function Gc.Memprof.discard p discards the state of the profile p, so no more callbacks for it will be called.

Multi-domain semantics, inapplicable to OCaml 4:

Sampling is per-domain. Gc.Memprof.start starts sampling for the current domain only;
If a domain is sampling when it calls Domain.spawn, the new domain also samples, with the same profile;
If any sampling domain calls Gc.Memprof.stop() then all domains stop sampling for that profile;
If any domain calls Gc.Memprof.discard p, then all domains stop running callbacks for profile p;
All callbacks are run by the domain which allocated the tracked block, if possible;
But: if the domain which allocated a block has terminated before it is promoted or de-allocated, that block's callbacks may be run (when appropriate) by another domain.

Future Work

The introduction of Gc.Memprof.t, and the separation of stop and discard, are suggestive of many future API extensions, such as the following, which are not treated in this PR but left for future work. The implementation has been made with this sort of extension in mind.

    val join      : t -> unit
    val current   : unit -> t option
    val leave     : unit -> unit
    val suspended : ('a -> 'b) -> 'a -> 'b
    val status    : t -> ???
    val stats     : t -> ???

Design

Much of the design is shared with OCaml 4 statmemprof, with global state made per-domain. All the code is freshly written, as various OCaml 4 assumptions are unsound in the multicore runtime, but some familiarity with the design of OCaml 4 statmemprof is recommended for review of this PR.

The rest of this design section is structured in a similar order to the commits on this PR. Much of the text here is also provided in a design comment in memprof.c.

Profile objects

The new Gc.Memprof.t profile value, called a "configuration" in the implementation, is a block allocated on the Caml heap with fields for the various configuration items, and a field for "status" which allows distinct domains to safely stop and discard (with atomic reads and writes).

Entries and Entry Tables

Each block of memory tracked by memprof is represented by an "entry" structure (entry_s, *entry_t). It tracks the state of the block of memory, and its progress through the various callbacks.

A resizable table of entry structures is called an "entries" table (entries_s, *entries_t). It tracks ranges of those entries which may (a) be ripe for running a callback, (b) be marked for deletion, or (c) contain pointers to the minor heap (to be scanned in a minor collection). As processing each of these actions proceeds linearly through the table, this tracking is done simply by keeping the lowest possible entry index for each purpose. The code to perform each action (running a callback, evicting a deleted entry, or scanning a pointer) checks whether an entry does require the action before performing it.

The entries table also has a pointer to the configuration object on the Caml heap, for the profile under which all the entries in the table were sampled. This allows callbacks on the table to be run at any later time, regardless of the currently-sampling profile of the particular domain running the callback. A consequence is that all entries in a table must be from the same profile.

After a profile is discarded, entries may still exist for blocks allocated in that profile, but no callbacks will be called for it (those entries themselves will be discarded lazily).

There is code for iterating over entries in a table, which is used when scanning for GC roots or updating tables to reflect GC activity (see below).

Thread State

The memprof state of a particular systhread is a "thread state" (memprof_thread_s, *memprof_thread_t), introduced in #12382. It has an entries table, for blocks allocated by this thread whose allocation callback has not yet completed. All allocation callbacks are guaranteed to be called by the thread performing the allocation (in the rare circumstance in which this is impossible, the tracking entry is discarded).

This thread state structure exists whether or not the systhreads module is initialized (one thread state per domain), and whether or not memprof is running.

Domain State

The memprof state of a domain is a "domain state" (memprof_domain_s, *memprof_domain_t), introduced in #12382. It has an entries table, for blocks allocated in this domain whose allocation callbacks have completed. If a domain terminates, or starts a new profile, while it still has tracked entries from a previous profile, those tracked entries become "orphaned" (see below).

The domain state has a linked list of thread states for all the threads in the domain, and a pointer to the current thread state.

This structure exists whether or not memprof is running. A pointer to it is kept in the caml_domain_state, accessed as dom_st->memprof.

Orphans

When sampling is stopped for a profile, all domains and threads continue to manage the entry tables for it as before, but without sampling and creating new entries. However, if a domain starts a new profile while it has entries (tracked blocks) from a previous profile which has not been "discarded", it moves those entries to its "orphans" list - a linked list of entry tables - for subsequent processing.

If a domain is terminated, all its current and orphaned entries (and those of its threads) are moved to a global orphans list. This list, and its protective lock orphans_lock, are the only memprof global variables. No domain processes the entries in the global orphans list directly: the first domain to look at the list (either at a collection or when checking for pending callbacks) adopts all entry tables on it into its own orphans list, and then processes them as its own.

Synchronisation

Mostly threads and domains are free to run callbacks on their own allocated blocks without explicitly synchronising. Care is taken not to assume that the memprof state of any given thread or entry in a domain is preserved outside of memprof code, as another thread in the same domain may run and modify that state, but we assume that the systhreads module effectively serializes entries to memprof within a single domain (for these purposes, entering and returning from a callback is treated as leaving and re-entering memprof code).

However, there are some structures shared between domains. The main such structure is the profile configuration object on the Caml heap. The only field written in this object is the status field, and this is used to communicate between domains sharing the profile, when a profile is stopped or discarded. This field is inspected or set atomically by the Status and Set_status macros. If a profile is found to be discarded (CONFIG_STATUS_DISCARDED) then no domain need take any action on it (and we can lazily discard any state from it).

The only other data shared between domains is the global orphans list. As noted above, this is protected by a single global lock, orphans_lock. Because an entry table only gets onto the global orphans list when its owning domain terminates (at which point all threads of that domain have terminated), and a table is adopted from the global orphans list before being processed, all callbacks and other entry table processing is performed by a thread of the domain which owns the entry table (and actions of those threads are serialized by systhreads).

Random Number Generation

We sample every word of allocation with the same probability (lambda, usually very small) - a Bernoulli trial. For the allocation of a block on the shared heap, or any allocation from the C runtime, we need to know how many samples we make of that block (usually zero). This is a binomial random variable, parameterized by lambda and N (the number of words in the block, including the header).

For allocations by Caml on the minor heap, we use the existing GC trigger mechanism, to cause Caml to enter the runtime when "the next sample" is due. The amount of allocation before "the next sample" is a geometric random variable, parameterized by lambda.

We focus on generating geometric pseudo-random numbers (PRNs), and simulate binomial PRNs for parameters (lambda, N) by counting geometric PRNs for lambda which sum to no more than N.

We use a high-quality high-performance 32-bit uniform PRNG (xoshiro128+), with per-domain state vectors. We initialize the per-domain state vector with a low-quality PRNG (SplitMX64), seeded separately for each domain.

To convert from a uniform PRN u to a geometric PRN g, we compute g = floor(1 + log(u) / log(1-lambda)) where we treat u as uniformly distributed in [0,1]. We pre-compute 1/log(1-lambda), and compute log(u) using a combination of type punning and a 3rd-degree polynomial.

For further efficiency we generate geometric PRNs in blocks, and the generating code is designed to be vectorizable.

Interface with GC

Root scanning

Memprof may have a large number of strong GC roots: one per tracked block, pointing to the tracking information ('minor or 'major, in the Gc.Memprof.tracker sense), plus the pointer to a config block in every entries table. Rather than manually registering and deregistering all of these, the GC calls caml_memprof_scan_roots() to scan them, in either minor or major collections. This function is called by all domains in parallel. A single domain adopts any global orphaned entries tables, and then each domain scans its own roots.

Updating block status

After a major or minor GC, memprof has to check tracked blocks to discover whether they have survived the GC, or (for a minor GC) whether they have been promoted to the major heap. This is done by caml_memprof_after_minor_gc() and caml_memprof_after_major_gc(), which share the system for iterating over entries tables as used by caml_memprof_scan_roots(). Again, these functions are called by all domains in parallel; a single domain starts by adopting any global orphaned entries tables, and then each domain updates its own entries.

Compaction

GC compaction may move all objects in the major heap, so all memprof roots must be scanned and potentially updated, including the weak roots (i.e. pointers to the tracked blocks). This is done by the same caml_memprof_scan_roots() function as root scanning in regular GCs, using a boolean argument to indicate that weak roots should also be scanned.

Backtraces

We have to be able to sample the current backtrace at any allocation point, and pass it (as a Caml array) to the allocation callback. We assume that in most cases these backtraces have short lifetimes, so we don't want to allocate them on the shared heap. However, we can't always allocate them directly on the Caml minor heap, as some allocations (e.g. allocating in the shared heap from the runtime) may take place at points at which GC is not safe (and so minor-heap allocation is not permitted). In those cases we "stash" the backtrace on the C heap, and copy it onto the Caml heap when we are about to call the allocation callback.

Sampling

We sample allocation for all threads in a domain which has a currently sampling profile, except when such a thread is running a memprof callback, which "suspends" sampling on that thread.

Allocation sampling divides into two cases: one simple and one complex.

Simple Sampling

When sampling an allocation by the runtime (as opposed to allocation by Caml), an entry is added to the thread's entry table, for subsequent processing. No allocation callback is called at allocation time, because the heap may not be consistent so allocation by the callback is not safe (see "Backtraces"). So the callback is "deferred" (although the backtrace is taken at allocation time and allocated on the C heap).

Minor Heap Caml Allocation Sampling

Caml code allocates on the minor heap by pointer-bumping, and only drops into the runtime if the young_ptr allocation pointer hits the young_trigger, usually triggering a garbage collection. When profiling, we set the trigger at the next word which we want to sample (see "Random Number Generation"), thus allowing us to enter memprof code at the approporiate allocation point. However, sampling the allocation is more complex in this case for several reasons:

Deferred allocation. A sampled block is not actually allocated until the runtime returns to the GC poll point in Caml code, after the memprof sampling code has run. So we have to predict the address of the sampled block for the entry record, to track its future promotion or collection. Until the allocation callback has run, instead of the allocated block address, the entry holds the offset in words of the block within the combined allocation, and the entry's offset field is set.
Combined allocations. A single GC poll point in Caml code may combine the allocation of several distinct blocks, each of which may be sampled independently. We create an entry for each sampled block and then run all allocation callbacks.
Prompt allocation callbacks. We call allocation callbacks directly from memprof as we sample the allocated blocks. These callbacks could be deferred (as are the ones in the "Simple Sampling" case), but that would require twice as many entries into memprof code. So the allocation callback is called before the sampled block is actually allocated (see above), and several allocation callbacks may be called at any given GC poll point (due to combined allocations). We take care to arrange heap metadata such that it is safe to run allocation callbacks (which may allocate and trigger minor and major GCs).
Other callbacks. In order to call the allocation callbacks from the poll point, we process the thread's entries table. This may call other callbacks for the same thread (specifically: deferred "Simple Sampling" callbacks).
Callback effects. Any callback may raise an exception, stop sampling, start a new profile, and/or discard a profile.
- If a callback raises an exception, none of the allocations from the current poll point will take place. However, some allocation callbacks may already have been called. If so, we mark those entries as "deallocated", so that matching deallocation callbacks will run. We simply delete any tracking entry from the current poll point which has not yet run an allocation callback. Then we propagate the exception up to Caml.
- If a callback stops sampling, subsequent allocations from the current poll point will not be sampled.
- If a callback stops sampling and starts a new profile, none of the allocations from the current poll point are subsequently tracked (through promotion and/or deallocation), as it's not possible to reconstruct the allocation addresses of the tracking entries, so they are simply deleted (or marked as deallocated, as in the exceptional case). The new profile effectively begins with the following poll point or other allocation.

Most of this complexity is managed in caml_memprof_track_young().

Callbacks

Some callbacks are run at allocation time, for allocations from Caml (see under "Sampling" above). Other allocation callbacks, and all post-allocation callbacks, are run during caml_memprof_run_callbacks_exn(), which is called by the runtime's general pending-action mechanism at poll points.

We set the domain's action-pending flag when we notice we have pending callbacks. Caml drops into the runtime at a poll point, and calls caml_memprof_run_callbacks_exn(), whenever the action-pending flag is set, whether or not memprof set it. So memprof maintains its own per-domain pending flag, to avoid suspending/unsuspending sampling, and checking all the entries tables, when there are no pending callbacks.

This is particularly important because when we unsuspend sampling, we reset the young-limit, which has the side-effect of setting the domain's action-pending flag.

Allocation callbacks are always run by the thread which made the allocation, unless that thread terminates before running the callback, in which case it is inherited by the domain.

Callbacks are run by iterating through candidate entries in a entry table. See under "Entries" above. A single entry may have more than one callback to run (if, for example, it has been promoted and garbage collected since the last time callbacks for that entry were run) - they are run in the natural order.

History

Statmemprof was originally implemented for Ocaml 4 by @jhjourdan and @stedolan in 2019/2020, and is used in various tools such as memtrace. The interactions with multicore OCaml were complex enough that it was decided to remove statmemprof when merging multicore into the trunk for Ocaml 5.

This PR builds on the preparatory work in #12381, #12382, and #12383. It is a radical rework of the draft PR #12379, and subsumes the draft PR #12390.

stedolan · 2024-01-22T14:49:54Z

(I'm reviewing this at the moment)

jhjourdan · 2024-01-22T15:27:00Z

I'm planning to have a look during the week, too.

OlivierNicole · 2024-01-22T15:35:35Z

Would it make sense to add the run-thread-sanitizer label on this PR to run TSan on the updated testsuite?

gasche · 2024-01-22T15:43:49Z

I did as you suggested to give the new label a try. (I had to create it first and pick a random color, I hope you like it.)

gasche · 2024-01-22T16:00:06Z

On the PRNG part, I could see users willing to fix the Memprof seed for better reproducibility. The current approach where each domain is seeded independently from a heap address makes this hard. (It also makes it hard to reason about the independence of the initial states.) Could we instead have a global PRNG state that is used only by caml_memprof_new_domain, and takes a lock? Then it is easy to add some compiler user interface to fix its seed if we want.

Note: the existing code already adds both Xoshiro (for the sampling, for performance reason) and SplitMix (to seed the Xoshiro state). But now, unlike when Memprof was first implemented, we already have a PRNG in the C runtime, namely LXM in prng.c. Maybe you could use this instead of SplitMix for seeding, to reduce the total numeber of PRNG implementations?

gasche · 2024-01-22T16:03:27Z

(Thanks for the effort writing the PR description and the comments in the code, this is great. I may not have the time to look at it soon and am not a competent reviewer, but I'm very happy about the documentation outlook.)

NickBarnes · 2024-01-22T16:21:54Z

We could absolutely use L64X128 (another Vigna PRNG!) for seeding (by the way, I found another PRNG in the runtime just today, in skiplist.c). At present, for no good reason, we call xoshiro_init() from caml_memprof_start() as well as caml_memprof_domain_create(); we could just call it once per domain.
I see your point about reproducibility. That's going to be hard, as it always is in parallel systems. For programs in which different domains call Domain.spawn, the order in which different child domains arrive at caml_memprof_domain_create() may depend on kernel scheduling, so just having a mutex-protected counter won't give reproducibility. I'm open to suggestions.
We could get away without our own mutex, as at present caml_memprof_domain_create() is called with the all_domains_lock held. But I'd rather not depend on that.

NickBarnes · 2024-01-22T16:25:36Z

Using dom_st->id in the seed would be reproducible in many common cases in which domains start in a deterministic order. Maybe that would be good enough?

stedolan

This looks great so far. I have the usual pile of nitpicks, but nothing dramatic. (I haven't read the whole thing yet, so I'll have another hopefully smaller series of comments soon)

runtime/memory.c

runtime/shared_heap.c

runtime/memprof.c

stedolan · 2024-01-22T16:56:07Z

Using dom_st->id in the seed would be reproducible in many common cases in which domains start in a deterministic order. Maybe that would be good enough?

Yeah, that seems fine. (If the domains don't start in a deterministic order, then the program does something nondeterministic and memprof isn't going to fix that)

NickBarnes · 2024-01-25T16:45:18Z

I'm just finishing off edits addressing @stedolan's review. Various people have also made positive remarks in other channels. I would appreciate a quick look from @jhjourdan.

NickBarnes · 2024-01-25T17:16:26Z

Regarding the suggestion by @gasche to use the PRNG in prng.c, I am reluctant to make this change in this PR because it would require changes to the interface of that module (to expsoe state-creation, with seed input), and it's getting away from the point of this PR. Maybe we should make an issue that there are too many PRNGs in the runtime (I know of at least four - although the two in skiplist.c and lf_skiplist.c are the same as each other - and maybe there are more).

gasche · 2024-01-25T17:27:11Z

If you were willing to suggest a prng.h addition that you would like, I would be happy to try to help by submitting this as an independent PR. If it gets merged fast enough then you can do the simplification on your side as well, otherwise we deal with this later.

(I don't understand the details of the of performance and design constraints on the skiplist PRNG, so I would hesitate to propose replacing it by any of the other ones. On the other hand it seems pretty clear to me that statmemprof's SplitMix can be replaced by prng.c. And I am bit wary of the argument of "someone else is already doing this suboptimal thing so it's fine", which can be a slippery slope.)

NickBarnes · 2024-01-25T17:43:51Z

Fair. I'll think about PRNG next Tuesday. Basically prng.c appears to manipulate RNG states (struct LXM_state) which are created and manipulated elsewhere (stdlib/random.ml, apparently, where they are Random.State.t values, which are bigarrays), and passed as value. memprof.c will need to create a PRNG state with a given seed, so we'll need code in prng.c to do that, and a prng.h defining it.

NickBarnes · 2024-01-25T18:05:55Z

Fixed all @stedolan's points except the one about the complex loops in caml_memprof_track_young, which I will come back to on Tuesday 2024-01-30.

NickBarnes · 2024-01-25T21:08:25Z

A fairly basic prng.h interface could look like this:

enum {
    CAML_PRNG_L64X128, /* LXM family, Steele and Vigna */
    /* room for expansion to other PRNGs, which we could add or migrate to prng.c if needed */
};

typedef struct prng_state *prng_state_t;

prng_state_t caml_prng_create(uint64_t seed, /* a char* and a size_t might be better */
                              int variety /* a value from the enum above */);

uint64_t caml_prng_next(prng_state_t);

void caml_prng_destroy(prng_state_t state);

This gives us a bit of wiggle room for adding other PRNGs, or migrating them here from elsewhere, while otherwise keeping the implementation flexible by making the interface pretty opaque. If we wanted more wiggle room (e.g. for wider PRNGs), I guess we could pass a void* to caml_prng_next.

Of course the enum type is un-named and variety is an int, because this is C not C++.

edwintorok · 2024-01-25T22:39:33Z

if you want another PRNG to compare LXM against then wyrand appears to be one of the shortest ones with good statistical properties, at least according to this page (more portable version and paper). I only mention this because it is useful to have at least 2 implementations when testing/designing an API.

Be careful with initialization for a parallel PRNG. In "Pseudorandom Number Generation: Impossibility and
Compromise" (Matsumoto, 2006) they argue that " many modern PRNGs have a defect in the initialization scheme. If the initial seeds are systematically chosen (for example, if the seeds are chosen to be 0, 100, 200, . . . ), then the output sequences are often strongly correlated."

According to Parallel Random Numbers: As Easy as 1, 2, 3 an API for a parallel RNG should have separate 'key' and 'counter' parameters, this offers more flexibility in choosing how to use the PRNG, either multistream (new key for each thread), or substream (partition counters between threads and use same key, and we have at most 128 domains, so partitioning might be doable).

Of course most RNGs aren't counter based, so this wouldn't directly apply here, but it might be useful to give 'prng_create' an additional 'OCaml domain id' as a parameter. If the RNG can't do anything useful with it, it is free to ignore it, but it might allow experimenting with different seeding mechanisms and different PRNGs for parallel programs more easily.
This is mentioned in the (Matsumoto 2006) paper as well "each subject consuming the random numbers is assigned a unique ID, and a PRNG characterized by the ID is assigned to each".

NickBarnes · 2024-02-08T18:59:12Z

While addressing the points raised by @stedolan, I noticed (once again) the following problem, which I think is outside the scope of this PR but is provoked by code here so should be fixed after this is merged; I'll raise it as an issue at that point.
With this PR, if there are pending memprof callbacks:

we set the action-pending flag;
caml_memprof_run_callbacks_exn gets called, which suspends, runs the callbacks, and unsuspends;
suspending and unsuspending both call caml_reset_young_limit, which sets the action-pending flag;
so we get back into caml_memprof_run_callbacks_exn, which happily notices that it has nothing to do.

So we get into memprof (and the pending-actions code in signals.c) twice rather than once. This is a general problem with caml_reset_young_limit. It's unclear to me that it should be unconditionally setting the action-pending flag, which is a rather separate concern.

gadmm · 2024-02-08T19:09:19Z

The current PR might motivate making caml_reset_young_limit a bit more clever, and indeed it is better done separately.

NickBarnes · 2024-02-08T19:12:47Z

Converted temporarily back to draft while I split/rebase/squash the most recent 4 commits. The tip of this branch does address all of @stedolan's review comments, but not those by @gasche as neither of us has yet reworked prng.c.

gasche · 2024-02-08T20:18:21Z

I didn't mean my PRNG comment to block statmemprof merging. Please feel free to make review and merge decisions independently, we can always cleanup later if the PR gets merged in its current PRNG state.

gadmm · 2024-02-09T20:40:47Z

@NickBarnes your comment is in fact related to an old discussion we had about possibly reverting bed5988. See gadmm@285bd59b for a rebased reversion commit.

dra27 · 2024-03-16T11:55:33Z

My understanding is that the TSAN failure is not related to this PR and requires some reconfiguration in the CI workers - I've removed the label in order to "clear" the ugly ❌🧹

runtime/memprof.c

kit-ty-kate

this should fix the error on macOS

runtime/memprof.c

NickBarnes · 2024-03-17T14:50:51Z

Thanks to @fabbing and @OlivierNicole for their help figuring out the problems in TSAN CI.

shindere · 2024-03-18T10:04:48Z

Nick Barnes (2024/03/17 07:49 -0700):

My alternative is to add a null usage `(void)new_entries;`, to tell the compiler not to warn.

If you do that it will perhaps be nice to have a comment explaining why the line is there, to avoid somebody inadvertently removing it later.

gasche · 2024-03-19T16:19:09Z

Congratulations on the merge!

Three minor comments on the Changes entry (I found it while rebasing #13013):

I did not do a review in any realistic sense, so I should not be mentioned as a reviewer.
(I did it for the original implementation and it was more than a full day of work, so I can tell the difference.)
I would move this to the "Runtime system" section where most of the changes lies, even though of course this is exposed through the stdlib.
It is customary to start the credit information in parentheses on a new line instead of wrapping it with the content.

I may move it around myself applying these guidelines.

* Import new Gc.Memprof API * Import new statmemprof testsuite * Runtime4 support for new statmemprof testsuite * Port 12383: backtrace abstractions * Import runtime5 statmemprof implementation Ports PRs ocaml#12382, ocaml#12817, ocaml#12824, ocaml#12923, ocaml#13068 * Avoid allocating unnecessarily during bytecode callbacks * Minor statmemprof testsuite robustness fixes

…-rebase Statmemprof resurrected

…-rebase Statmemprof resurrected (cherry picked from commit e72d835)

…1-statmemprof-rebase Statmemprof resurrected

…-rebase Statmemprof resurrected (cherry picked from commit e72d835) (cherry picked from commit 89992d7)

…-rebase Statmemprof resurrected (cherry picked from commit e72d835)

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from fbe504c to a9f615f Compare January 19, 2024 17:31

stedolan self-assigned this Jan 22, 2024

gasche added the run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled label Jan 22, 2024

stedolan reviewed Jan 22, 2024

View reviewed changes

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from a9f615f to e7190fd Compare January 25, 2024 18:04

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from 2375436 to c7e3be1 Compare February 8, 2024 19:02

NickBarnes marked this pull request as draft February 8, 2024 19:09

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from c7e3be1 to 0a86f5f Compare February 8, 2024 19:19

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from 0a86f5f to ba0434a Compare February 10, 2024 23:41

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from 4a1ab9d to 9757000 Compare March 15, 2024 18:38

dra27 removed the run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled label Mar 16, 2024

dra27 reviewed Mar 17, 2024

View reviewed changes

runtime/memprof.c Outdated Show resolved Hide resolved

kit-ty-kate reviewed Mar 17, 2024

View reviewed changes

runtime/memprof.c Outdated Show resolved Hide resolved

runtime/memprof.c Show resolved Hide resolved

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from 9757000 to ca8b74c Compare March 17, 2024 14:47

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from ca8b74c to a28eed8 Compare March 17, 2024 16:01

NickBarnes added 5 commits March 18, 2024 11:31

Statmemprof: Sampling.

eb7f2de

Statmemprof: Running callbacks.

ae592f9

Statmemprof: Minor changes to complete statmemprof.

f409887

Statmemprof: Tests.

db62faf

Statmemprof: Changes.

62d04fa

NickBarnes force-pushed the nick-11911-statmemprof-rebase branch from a28eed8 to 62d04fa Compare March 18, 2024 11:32

stedolan approved these changes Mar 18, 2024

View reviewed changes

stedolan merged commit e72d835 into ocaml:trunk Mar 18, 2024

NickBarnes mentioned this pull request Mar 22, 2024

statmemprof is absent in OCaml 5.0 #11911

Closed

This was referenced Apr 3, 2024

Minor post-merge improvements to statmemprof #13068

Merged

Statmemprof for multicore [do not merge] #12379

Closed

Basic wiring to connect multicore statmemprof to GC and allocation #12390

Closed

eutro mentioned this pull request Apr 18, 2024

Peform table-based translation of runtime event names and tags tarides/runtime_events_tools#44

Draft

NickBarnes pushed a commit to NickBarnes/ocaml that referenced this pull request May 21, 2024

Merge pull request ocaml#12923 from NickBarnes/nick-11911-statmemprof…

437a74c

…-rebase Statmemprof resurrected

NickBarnes pushed a commit to NickBarnes/ocaml that referenced this pull request May 21, 2024

Merge pull request ocaml#12923 from NickBarnes/nick-11911-statmemprof…

89992d7

…-rebase Statmemprof resurrected (cherry picked from commit e72d835)

nmote pushed a commit to nmote/ocaml that referenced this pull request Oct 18, 2024

cherry pick: Merge pull request ocaml#12923 from NickBarnes/nick-1191…

417444f

…1-statmemprof-rebase Statmemprof resurrected

MisterDA pushed a commit to MisterDA/ocaml that referenced this pull request Feb 13, 2025

Merge pull request ocaml#12923 from NickBarnes/nick-11911-statmemprof…

e03fade

…-rebase Statmemprof resurrected (cherry picked from commit e72d835) (cherry picked from commit 89992d7)

dra27 pushed a commit to ocaml-multicore/ocaml that referenced this pull request Feb 13, 2025

Merge pull request ocaml#12923 from NickBarnes/nick-11911-statmemprof…

0ae3976

…-rebase Statmemprof resurrected (cherry picked from commit e72d835)

gadmm mentioned this pull request Mar 31, 2025

Set action_pending more sparingly inside caml_reset_young_limit #13915

Open

Statmemprof resurrected #12923

Statmemprof resurrected #12923

Uh oh!

Conversation

NickBarnes commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Semantics

Future Work

Design

Profile objects

Entries and Entry Tables

Thread State

Domain State

Orphans

Synchronisation

Random Number Generation

Interface with GC

Root scanning

Updating block status

Compaction

Backtraces

Sampling

Simple Sampling

Minor Heap Caml Allocation Sampling

Callbacks

History

Uh oh!

stedolan commented Jan 22, 2024

Uh oh!

jhjourdan commented Jan 22, 2024

Uh oh!

OlivierNicole commented Jan 22, 2024

Uh oh!

gasche commented Jan 22, 2024

Uh oh!

gasche commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Jan 22, 2024

Uh oh!

NickBarnes commented Jan 22, 2024

Uh oh!

NickBarnes commented Jan 22, 2024

Uh oh!

stedolan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stedolan commented Jan 22, 2024

Uh oh!

NickBarnes commented Jan 25, 2024

Uh oh!

NickBarnes commented Jan 25, 2024

Uh oh!

gasche commented Jan 25, 2024

Uh oh!

NickBarnes commented Jan 25, 2024

Uh oh!

NickBarnes commented Jan 25, 2024

Uh oh!

NickBarnes commented Jan 25, 2024

Uh oh!

edwintorok commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickBarnes commented Feb 8, 2024

Uh oh!

gadmm commented Feb 8, 2024

Uh oh!

NickBarnes commented Feb 8, 2024

NickBarnes commented Jan 19, 2024 •

edited

Loading

gasche commented Jan 22, 2024 •

edited

Loading

edwintorok commented Jan 25, 2024 •

edited

Loading

gasche commented Mar 19, 2024 •

edited

Loading