Skip to content

Ensure that old frame descriptor tables are freed#11673

Merged
Octachron merged 1 commit intoocaml:trunkfrom
stedolan:free-old-frame-tables
Nov 8, 2022
Merged

Ensure that old frame descriptor tables are freed#11673
Octachron merged 1 commit intoocaml:trunkfrom
stedolan:free-old-frame-tables

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

This patch fixes the issue in #11662, where frame descriptor tables were not being freed.

Freeing an old frame descriptor table is more complicated in OCaml 5 than in OCaml 4, because at the point the new frame descriptor table is allocated the old one may still be in use by another domain. So, the actual freeing of memory is delayed until the end of the GC cycle, at which point it is certain to no longer be in use.

There were two bugs in the implementation of this:

  1. If the table is rebuilt several times within a single cycle, there may be more than one old table needing to be freed. While the logic for delaying freeing carefully constructed a linked list of old tables, the logic for actually freeing them ignored this list and just freed the first one, allowing the rest to leak (!)

  2. There was no pressure on the GC to speed up as a result of constructing many frame tables, so the memory usage could increase a lot before a GC cycle happened to complete and free the old tables.

This patch fixes (1) by actually freeing the tables in a loop, and fixes (2) by calling caml_adjust_gc_speed so that the GC speeds up to prevent stale frametables occuping huge amounts of space. This brings the memory usage of Frama-C startup down to vaguely reasonable levels (about 1.5x OCaml 4, instead of 10x).

There's a further optimisation that's not yet done here: now that OCaml 5 has synchronised parallel minor GCs, it is no longer necessary to wait for a full major GC cycle to be certain an old frametable is dead, as it will certainly not be in use after the next minor GC (which has a synchronisation point). So, it should be possible to wait only for the next minor GC instead of the next major one, which should also reduce time spent in otherwise-useless GC cycles in very plugin-heavy code.

@stedolan
Copy link
Copy Markdown
Contributor Author

An obvious further optimisation just occurred to me: if there is only one domain (i.e. a single-threaded program), then you may as well free the old frametable immediately. That way, the costs of ensuring that parallel Dynlink is safe are paid only by programs that actually do parallel Dynlinking.

caml_stat_free(ft->prev->table.descriptors);
caml_stat_free(ft->prev);
struct frametable_version *p = ft->prev;
while (p != NULL) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use ft->prev directly rather than introduce p? When I was looking at the issue, I had the impression that it could be nice to directly use ft->prev = NULL as the loop post-condition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep ft as it is because of the write to ft->free_prev_after_cycle at the end.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly agree with keeping ft, but I am missing the interaction between writes at ft->prev and the write at ft->free_prev_after_cycle?

p = next;
}
ft->prev = NULL;
atomic_store_rel(&ft->free_prev_after_cycle, No_need_to_free);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this part of this code, I was wondering if we want to move this write before the cleaning loop? Since the prev field is not part of the public header and is only modified when holding the descr_mutex, it seems fine to me to signal early that there is no cleaning work left to do?

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the bugfix is correct.

There is no obvious way to adapt the code to release old tables after the next minor GC (rather than the next major GC), because we don't keep a global count of the number of minor GCs, only per-domain counts. (We could keep a global count.)

The code seems to be careful to free the old frame descriptors lazily, only upon access to the new values. Is this an important optimization, compared to adding a call to try to free them at the end of each GC cycle?

@Octachron Octachron added this to the 5.0 milestone Oct 27, 2022
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 27, 2022

Naive question, since this looks like GC-ing the tables, have you considered reference-counting the tables and dropping them as soon as they are no longer in use? This would avoid having to build a linked list of old tables altogether. I am missing why the ownership would be hard to track.

@Octachron
Copy link
Copy Markdown
Member

I also agree with the change (independently of my naive questions), however I am suggesting to add a Change entry as the release manager: this is the kind of usability change that matters when announcing the rc or beta releases and I like to be able to point to a change entry in those cases.

@stedolan
Copy link
Copy Markdown
Contributor Author

Naive question, since this looks like GC-ing the tables, have you considered reference-counting the tables and dropping them as soon as they are no longer in use?

The trouble is that refcounting causes contention because multiple concurrent readers are all synchronising on the shared reference count - that is, you pay synchronisation overheads even in the absence of any writers. (Maybe we can ensure that this structure isn't accessed too often and the overhead is manageable, I'm not sure)

@stedolan
Copy link
Copy Markdown
Contributor Author

The code seems to be careful to free the old frame descriptors lazily, only upon access to the new values. Is this an important optimization, compared to adding a call to try to free them at the end of each GC cycle?

No, it's not important. That approach makes sense to me!

@stedolan
Copy link
Copy Markdown
Contributor Author

I had a go at implementing the single-threaded optimisation, but it doesn't speed up the plugin-heavy Frama-C startup all that much. The trouble is that the OCaml 5 dynlink logic is fundamentally O(n^2) as it rebuilds the entire frame descriptor table each time, while the OCaml 4 version is O(n) as it mutates the existing one in-place.

(When writing this logic some years ago, I thought it would rebuild the table once per loaded .cmxs, which is annoying but bearable. It's much worse: it rebuilds the table once per module contained in each loaded .cmxs, because it needs a populated table between the initialisation of any two modules. There can be a lot of those)

I think the right fix is probably to think seriously about the frame descriptor table as a mutable concurrent hashtable, or to globally synchronise all accesses to the frametable. Both of these are things I was really hoping not to have to do (because apart from Dynlink, they're not necessary). That fix may have to wait for 5.1, though, as it seems a bit late in the release cycle to add a concurrent hashtable to the runtime! Something along the lines of what's here will have to do for 5.0, but I'll think about it a bit more to see if I can find an easy fix.

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 31, 2022

For read-heavy write-rarely data structures, we can implement synchronization by ensuring that modifying the structure provokes a stop-the-world event. Then we can read from any STW-registered mutator without any synchronization on reads. This seems reasonably easy/uninvasive to implement (famous last words...), provided the modifications don't occur too deeply inside the runtime code.

@Octachron
Copy link
Copy Markdown
Member

Any objections for merging this PR and eventually moving the discussion on a better implementation to an issue (I will push a change entry manually myself)?

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 7, 2022

We are discussing two issues in this PR:

  • a memory leak
  • a time-complexity regression compared to 4.x

The PR itself fixes only the memory leak, has been reviewed, and should indeed be merged asap (ideally in 5.0).

The time issue is less critical. We are talking about N=200 modules being loaded, and a memory leak proportional to N*2Mib is obvious to our users, whereas a N^2 time leak is less of a concern in practice.

@Octachron Octachron merged commit b06f2d1 into ocaml:trunk Nov 8, 2022
Octachron added a commit that referenced this pull request Nov 8, 2022
Ensure that old frame descriptor tables are freed

(cherry picked from commit b06f2d1)
@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 22, 2023

It looks like the "time issue" is also causing a noticeable regression for Coq which also uses natdynlink heavily for its plugins -- see user report in #11926.

Currently, if I understand correctly, the code is doing as follows:

  • we maintain a global list of all "frametable fragments" that have been "registered" by the dynlink machinery
  • from a given state of the list we can compute a "frametable version", by linearly building a hash table from the list elements
  • when we compute a frametable version, we cache it by storing it in the corresponding node of the "frametable fragments" list
  • there is a "current frametable" which points to the latest frametable version, but mutators do not use atomic accesses to read the frametable (for efficiency reasons?) so they may access older frametable versions
  • mutators "update" their view of the "current frametable" at least once per major cycle; when the current frametable is more than one major cycle old, we free all previous frametable versions

(We don't want to (1) require locking when accessing the frametable, which is a performance-critical operation, or (2) try to mutate a hash table while mutators could access it, which requires a complex concurrent hashtable implementation -- or locking on access, see (1).)

My proposed strategy would be as follows:

  • we maintain a single global "frame descriptor" hash table, accessed without synchronization
  • whenever we want to update the table to register a new fragment, we perform a stop-the-world to ensure that all mutators are stopped

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 22, 2023

I think that it's a bit of work to implement and test, but not that much either. (We did something similar to resize the minor heap and it works okay.) The resulting code would in fact probably be simpler than the current one.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 22, 2023

Unfortunately a stop-the-world synchronization cannot remove the quadratic behaviour since the iteration responsible for the quadratic blowup is located in OCaml code. Instead of reallocating the frame descriptor table quadratically, you stop the world quadratically. In legacy single-threaded code this will show a speed-up, but the benefits in the long term are less clear (except for space).

I sketched another solution but I was not confident this was the right place to discuss the time complexity problem, so I replied at #11926 instead.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 22, 2023

@gadmm I think that the quadratic behavior comes from the runtime code.

If we call caml_register_frametable N times:

void caml_register_frametable(intnat *table)
{
struct frametable_version *ft, *old;
caml_plat_lock(&descr_mutex);
frametables = cons(table, frametables);
old = (struct frametable_version*)atomic_load_acq(&current_frametable);
CAMLassert(old != NULL);
ft = caml_stat_alloc(sizeof(*ft));
ft->table = build_frame_descriptors(frametables);
atomic_store_rel(&ft->free_prev_after_cycle, caml_major_cycles_completed);
ft->prev = old;
atomic_store_rel(&current_frametable, (uintnat)ft);
/* Ensure that we GC often enough to prevent more than 1/4 of
heap memory being stale frame tables */
caml_adjust_gc_speed(
/* Size of the table just allocated */
(sizeof(*ft) + sizeof(ft->table.descriptors[0]) * (ft->table.mask + 1)),
/* 1/4 of the heap size */
caml_heap_size(Caml_state->shared_heap) / 4
);
caml_plat_unlock(&descr_mutex);
}

then build_frame_descriptors is called N times on the current list of frametable fragments:

static caml_frame_descrs build_frame_descriptors(link* frametables)
{

and each call on a list of length K<=N iterates on each element to build a hash table:

/* Fill the hash table */
iter_list(frametables,lnk) {
tbl = lnk->frametable;
len = *tbl;
d = (frame_descr *)(tbl + 1);
for (j = 0; j < len; j++) {
h = Hash_retaddr(d->retaddr, tblsize - 1);
while (table.descriptors[h] != NULL) {
h = (h+1) & table.mask;
}
table.descriptors[h] = d;
if (j != len - 1) {
d = next_frame_descr(d);
}
}
}

Am I missing something?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 22, 2023

Instead with my implementation strategy we would have N stop-the-world pauses where each one "grows" the global hash table by adding the content of the new frametable fragment.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 22, 2023

Right, sorry, I did not mean to write that you stop the world quadratically (though I did), this would be even worse.

What I meant is that the iteration responsible for calling caml_register_frametable a lot seems to be here:

List.iter (fun cu ->
try ndl_run handle cu
with exn ->
Printexc.raise_with_backtrace
(DT.Error (Library's_module_initializers_failed exn))
(Printexc.get_raw_backtrace ()))
(Unit_header.defined_symbols unit_header)

and it is not possible to place the stop-the-world section around all of this. So you can only replace each eventually-expensive call to build_frame_descriptor by an expensive call to stop the world. This is no longer N² but some NM, where: 1) M is very large (except in single-domain mode), 2) at least linear in the number in domains (if you count time across all domains). But I did not mean to make a formal complexity claim, only that the trade-off is not obvious.

I do not discard your solution, but I prefer one that makes caml_register_frametable inexpensive (being understood that the point of both solutions is to keep the cost low for the reader).

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Jan 22, 2023

We don't want to (1) require locking when accessing the frametable, which is a performance-critical operation

What about a reader-writer lock? GC root scanning and backtrace construction would lock for reading, and only dynlink would lock for writing.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 22, 2023

@stedolan was worried that the overheads of reference-counting would be too much for dealing with memory reclamation, so there would be the same (bigger?) worries for a reader-writer lock. But I am not convinced this would be too costly, I think this is a solution worth trying.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 22, 2023

@gadmm thanks for your clarification.

it is not possible to place the stop-the-world section around [the iteration on all symbols]

Maybe we could move to a different API where we first "prepare" a bunch of frametables and then add them all at the same time (with a single pause). But that does make this approach more invasive.

I agree that trying the reader-writer lock first is a good choice.

@stedolan
Copy link
Copy Markdown
Contributor Author

What about a reader-writer lock? GC root scanning and backtrace construction would lock for reading, and only dynlink would lock for writing.

Reader-writer locks cause contention even between readers, since the lock state must be atomically updated whenever a reader enters or exits the critical section. It's possible that we could find a coarse enough locking strategy for this to work OK, but just wrapping every use of the frametable won't work. (Maybe just a big reader critical section around GC pauses and backtrace collection is fine? However, backtrace collection can happen very frequently)

Maybe we could move to a different API where we first "prepare" a bunch of frametables and then add them all at the same time (with a single pause). But that does make this approach more invasive.

I think this is easy and worth doing in any case, I've opened #11935 to that effect.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 25, 2023

My intuition was that backtraces could be collected frequently, but not so frequently. I had in mind that the frametable was typically cold cache-wise, in which case the cost of an integer atomic update was negligible compared to other memory accesses. (The case where catchable exceptions are raised very often sounds like a case where you want to disable backtraces anyway.)

Still if you think the makeshift lock-free hash table could be useful despite this consideration and your separate improvement at #11935, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants