Ensure that old frame descriptor tables are freed#11673
Ensure that old frame descriptor tables are freed#11673Octachron merged 1 commit intoocaml:trunkfrom
Conversation
|
An obvious further optimisation just occurred to me: if there is only one domain (i.e. a single-threaded program), then you may as well free the old frametable immediately. That way, the costs of ensuring that parallel Dynlink is safe are paid only by programs that actually do parallel Dynlinking. |
| caml_stat_free(ft->prev->table.descriptors); | ||
| caml_stat_free(ft->prev); | ||
| struct frametable_version *p = ft->prev; | ||
| while (p != NULL) { |
There was a problem hiding this comment.
Would it make sense to use ft->prev directly rather than introduce p? When I was looking at the issue, I had the impression that it could be nice to directly use ft->prev = NULL as the loop post-condition.
There was a problem hiding this comment.
We need to keep ft as it is because of the write to ft->free_prev_after_cycle at the end.
There was a problem hiding this comment.
I certainly agree with keeping ft, but I am missing the interaction between writes at ft->prev and the write at ft->free_prev_after_cycle?
| p = next; | ||
| } | ||
| ft->prev = NULL; | ||
| atomic_store_rel(&ft->free_prev_after_cycle, No_need_to_free); |
There was a problem hiding this comment.
When reading this part of this code, I was wondering if we want to move this write before the cleaning loop? Since the prev field is not part of the public header and is only modified when holding the descr_mutex, it seems fine to me to signal early that there is no cleaning work left to do?
gasche
left a comment
There was a problem hiding this comment.
I agree that the bugfix is correct.
There is no obvious way to adapt the code to release old tables after the next minor GC (rather than the next major GC), because we don't keep a global count of the number of minor GCs, only per-domain counts. (We could keep a global count.)
The code seems to be careful to free the old frame descriptors lazily, only upon access to the new values. Is this an important optimization, compared to adding a call to try to free them at the end of each GC cycle?
|
Naive question, since this looks like GC-ing the tables, have you considered reference-counting the tables and dropping them as soon as they are no longer in use? This would avoid having to build a linked list of old tables altogether. I am missing why the ownership would be hard to track. |
|
I also agree with the change (independently of my naive questions), however I am suggesting to add a Change entry as the release manager: this is the kind of usability change that matters when announcing the rc or beta releases and I like to be able to point to a change entry in those cases. |
The trouble is that refcounting causes contention because multiple concurrent readers are all synchronising on the shared reference count - that is, you pay synchronisation overheads even in the absence of any writers. (Maybe we can ensure that this structure isn't accessed too often and the overhead is manageable, I'm not sure) |
No, it's not important. That approach makes sense to me! |
|
I had a go at implementing the single-threaded optimisation, but it doesn't speed up the plugin-heavy Frama-C startup all that much. The trouble is that the OCaml 5 dynlink logic is fundamentally O(n^2) as it rebuilds the entire frame descriptor table each time, while the OCaml 4 version is O(n) as it mutates the existing one in-place. (When writing this logic some years ago, I thought it would rebuild the table once per loaded I think the right fix is probably to think seriously about the frame descriptor table as a mutable concurrent hashtable, or to globally synchronise all accesses to the frametable. Both of these are things I was really hoping not to have to do (because apart from Dynlink, they're not necessary). That fix may have to wait for 5.1, though, as it seems a bit late in the release cycle to add a concurrent hashtable to the runtime! Something along the lines of what's here will have to do for 5.0, but I'll think about it a bit more to see if I can find an easy fix. |
|
For read-heavy write-rarely data structures, we can implement synchronization by ensuring that modifying the structure provokes a stop-the-world event. Then we can read from any STW-registered mutator without any synchronization on reads. This seems reasonably easy/uninvasive to implement (famous last words...), provided the modifications don't occur too deeply inside the runtime code. |
|
Any objections for merging this PR and eventually moving the discussion on a better implementation to an issue (I will push a change entry manually myself)? |
|
We are discussing two issues in this PR:
The PR itself fixes only the memory leak, has been reviewed, and should indeed be merged asap (ideally in 5.0). The time issue is less critical. We are talking about N=200 modules being loaded, and a memory leak proportional to N*2Mib is obvious to our users, whereas a N^2 time leak is less of a concern in practice. |
Ensure that old frame descriptor tables are freed (cherry picked from commit b06f2d1)
|
It looks like the "time issue" is also causing a noticeable regression for Coq which also uses natdynlink heavily for its plugins -- see user report in #11926. Currently, if I understand correctly, the code is doing as follows:
(We don't want to (1) require locking when accessing the frametable, which is a performance-critical operation, or (2) try to mutate a hash table while mutators could access it, which requires a complex concurrent hashtable implementation -- or locking on access, see (1).) My proposed strategy would be as follows:
|
|
I think that it's a bit of work to implement and test, but not that much either. (We did something similar to resize the minor heap and it works okay.) The resulting code would in fact probably be simpler than the current one. |
|
Unfortunately a stop-the-world synchronization cannot remove the quadratic behaviour since the iteration responsible for the quadratic blowup is located in OCaml code. Instead of reallocating the frame descriptor table quadratically, you stop the world quadratically. In legacy single-threaded code this will show a speed-up, but the benefits in the long term are less clear (except for space). I sketched another solution but I was not confident this was the right place to discuss the time complexity problem, so I replied at #11926 instead. |
|
@gadmm I think that the quadratic behavior comes from the runtime code. If we call ocaml/runtime/frame_descriptors.c Lines 161 to 186 in eb04c8b then ocaml/runtime/frame_descriptors.c Lines 71 to 72 in eb04c8b and each call on a list of length K<=N iterates on each element to build a hash table: ocaml/runtime/frame_descriptors.c Lines 96 to 111 in eb04c8b Am I missing something? |
|
Instead with my implementation strategy we would have N stop-the-world pauses where each one "grows" the global hash table by adding the content of the new frametable fragment. |
|
Right, sorry, I did not mean to write that you stop the world quadratically (though I did), this would be even worse. What I meant is that the iteration responsible for calling ocaml/otherlibs/dynlink/native/dynlink.ml Lines 83 to 89 in eb04c8b and it is not possible to place the stop-the-world section around all of this. So you can only replace each eventually-expensive call to build_frame_descriptor by an expensive call to stop the world. This is no longer N² but some NM, where: 1) M is very large (except in single-domain mode), 2) at least linear in the number in domains (if you count time across all domains). But I did not mean to make a formal complexity claim, only that the trade-off is not obvious. I do not discard your solution, but I prefer one that makes |
What about a reader-writer lock? GC root scanning and backtrace construction would lock for reading, and only dynlink would lock for writing. |
|
@stedolan was worried that the overheads of reference-counting would be too much for dealing with memory reclamation, so there would be the same (bigger?) worries for a reader-writer lock. But I am not convinced this would be too costly, I think this is a solution worth trying. |
|
@gadmm thanks for your clarification.
Maybe we could move to a different API where we first "prepare" a bunch of frametables and then add them all at the same time (with a single pause). But that does make this approach more invasive. I agree that trying the reader-writer lock first is a good choice. |
Reader-writer locks cause contention even between readers, since the lock state must be atomically updated whenever a reader enters or exits the critical section. It's possible that we could find a coarse enough locking strategy for this to work OK, but just wrapping every use of the frametable won't work. (Maybe just a big reader critical section around GC pauses and backtrace collection is fine? However, backtrace collection can happen very frequently)
I think this is easy and worth doing in any case, I've opened #11935 to that effect. |
|
My intuition was that backtraces could be collected frequently, but not so frequently. I had in mind that the frametable was typically cold cache-wise, in which case the cost of an integer atomic update was negligible compared to other memory accesses. (The case where catchable exceptions are raised very often sounds like a case where you want to disable backtraces anyway.) Still if you think the makeshift lock-free hash table could be useful despite this consideration and your separate improvement at #11935, let me know. |
This patch fixes the issue in #11662, where frame descriptor tables were not being freed.
Freeing an old frame descriptor table is more complicated in OCaml 5 than in OCaml 4, because at the point the new frame descriptor table is allocated the old one may still be in use by another domain. So, the actual freeing of memory is delayed until the end of the GC cycle, at which point it is certain to no longer be in use.
There were two bugs in the implementation of this:
If the table is rebuilt several times within a single cycle, there may be more than one old table needing to be freed. While the logic for delaying freeing carefully constructed a linked list of old tables, the logic for actually freeing them ignored this list and just freed the first one, allowing the rest to leak (!)
There was no pressure on the GC to speed up as a result of constructing many frame tables, so the memory usage could increase a lot before a GC cycle happened to complete and free the old tables.
This patch fixes (1) by actually freeing the tables in a loop, and fixes (2) by calling
caml_adjust_gc_speedso that the GC speeds up to prevent stale frametables occuping huge amounts of space. This brings the memory usage of Frama-C startup down to vaguely reasonable levels (about 1.5x OCaml 4, instead of 10x).There's a further optimisation that's not yet done here: now that OCaml 5 has synchronised parallel minor GCs, it is no longer necessary to wait for a full major GC cycle to be certain an old frametable is dead, as it will certainly not be in use after the next minor GC (which has a synchronisation point). So, it should be possible to wait only for the next minor GC instead of the next major one, which should also reduce time spent in otherwise-useless GC cycles in very plugin-heavy code.