Ensure that old frame descriptor tables are freed by stedolan · Pull Request #11673 · ocaml/ocaml

stedolan · 2022-10-27T13:09:46Z

This patch fixes the issue in #11662, where frame descriptor tables were not being freed.

Freeing an old frame descriptor table is more complicated in OCaml 5 than in OCaml 4, because at the point the new frame descriptor table is allocated the old one may still be in use by another domain. So, the actual freeing of memory is delayed until the end of the GC cycle, at which point it is certain to no longer be in use.

There were two bugs in the implementation of this:

If the table is rebuilt several times within a single cycle, there may be more than one old table needing to be freed. While the logic for delaying freeing carefully constructed a linked list of old tables, the logic for actually freeing them ignored this list and just freed the first one, allowing the rest to leak (!)
There was no pressure on the GC to speed up as a result of constructing many frame tables, so the memory usage could increase a lot before a GC cycle happened to complete and free the old tables.

This patch fixes (1) by actually freeing the tables in a loop, and fixes (2) by calling caml_adjust_gc_speed so that the GC speeds up to prevent stale frametables occuping huge amounts of space. This brings the memory usage of Frama-C startup down to vaguely reasonable levels (about 1.5x OCaml 4, instead of 10x).

There's a further optimisation that's not yet done here: now that OCaml 5 has synchronised parallel minor GCs, it is no longer necessary to wait for a full major GC cycle to be certain an old frametable is dead, as it will certainly not be in use after the next minor GC (which has a synchronisation point). So, it should be possible to wait only for the next minor GC instead of the next major one, which should also reduce time spent in otherwise-useless GC cycles in very plugin-heavy code.

stedolan · 2022-10-27T13:11:20Z

An obvious further optimisation just occurred to me: if there is only one domain (i.e. a single-threaded program), then you may as well free the old frametable immediately. That way, the costs of ensuring that parallel Dynlink is safe are paid only by programs that actually do parallel Dynlinking.

Octachron · 2022-10-27T13:18:54Z

runtime/frame_descriptors.c

-      caml_stat_free(ft->prev->table.descriptors);
-      caml_stat_free(ft->prev);
+      struct frametable_version *p = ft->prev;
+      while (p != NULL) {


Would it make sense to use ft->prev directly rather than introduce p? When I was looking at the issue, I had the impression that it could be nice to directly use ft->prev = NULL as the loop post-condition.

We need to keep ft as it is because of the write to ft->free_prev_after_cycle at the end.

I certainly agree with keeping ft, but I am missing the interaction between writes at ft->prev and the write at ft->free_prev_after_cycle?

Octachron · 2022-10-27T13:28:25Z

runtime/frame_descriptors.c

+        p = next;
+      }
      ft->prev = NULL;
      atomic_store_rel(&ft->free_prev_after_cycle, No_need_to_free);


When reading this part of this code, I was wondering if we want to move this write before the cleaning loop? Since the prev field is not part of the public header and is only modified when holding the descr_mutex, it seems fine to me to signal early that there is no cleaning work left to do?

gasche

I agree that the bugfix is correct.

There is no obvious way to adapt the code to release old tables after the next minor GC (rather than the next major GC), because we don't keep a global count of the number of minor GCs, only per-domain counts. (We could keep a global count.)

The code seems to be careful to free the old frame descriptors lazily, only upon access to the new values. Is this an important optimization, compared to adding a call to try to free them at the end of each GC cycle?

gadmm · 2022-10-27T14:03:24Z

Naive question, since this looks like GC-ing the tables, have you considered reference-counting the tables and dropping them as soon as they are no longer in use? This would avoid having to build a linked list of old tables altogether. I am missing why the ownership would be hard to track.

Octachron · 2022-10-27T14:06:18Z

I also agree with the change (independently of my naive questions), however I am suggesting to add a Change entry as the release manager: this is the kind of usability change that matters when announcing the rc or beta releases and I like to be able to point to a change entry in those cases.

stedolan · 2022-10-31T16:35:16Z

Naive question, since this looks like GC-ing the tables, have you considered reference-counting the tables and dropping them as soon as they are no longer in use?

The trouble is that refcounting causes contention because multiple concurrent readers are all synchronising on the shared reference count - that is, you pay synchronisation overheads even in the absence of any writers. (Maybe we can ensure that this structure isn't accessed too often and the overhead is manageable, I'm not sure)

stedolan · 2022-10-31T16:42:48Z

The code seems to be careful to free the old frame descriptors lazily, only upon access to the new values. Is this an important optimization, compared to adding a call to try to free them at the end of each GC cycle?

No, it's not important. That approach makes sense to me!

stedolan · 2022-10-31T16:53:58Z

I had a go at implementing the single-threaded optimisation, but it doesn't speed up the plugin-heavy Frama-C startup all that much. The trouble is that the OCaml 5 dynlink logic is fundamentally O(n^2) as it rebuilds the entire frame descriptor table each time, while the OCaml 4 version is O(n) as it mutates the existing one in-place.

(When writing this logic some years ago, I thought it would rebuild the table once per loaded .cmxs, which is annoying but bearable. It's much worse: it rebuilds the table once per module contained in each loaded .cmxs, because it needs a populated table between the initialisation of any two modules. There can be a lot of those)

I think the right fix is probably to think seriously about the frame descriptor table as a mutable concurrent hashtable, or to globally synchronise all accesses to the frametable. Both of these are things I was really hoping not to have to do (because apart from Dynlink, they're not necessary). That fix may have to wait for 5.1, though, as it seems a bit late in the release cycle to add a concurrent hashtable to the runtime! Something along the lines of what's here will have to do for 5.0, but I'll think about it a bit more to see if I can find an easy fix.

gasche · 2022-10-31T20:24:26Z

For read-heavy write-rarely data structures, we can implement synchronization by ensuring that modifying the structure provokes a stop-the-world event. Then we can read from any STW-registered mutator without any synchronization on reads. This seems reasonably easy/uninvasive to implement (famous last words...), provided the modifications don't occur too deeply inside the runtime code.

Octachron · 2022-11-07T14:48:47Z

Any objections for merging this PR and eventually moving the discussion on a better implementation to an issue (I will push a change entry manually myself)?

gasche · 2022-11-07T15:09:58Z

We are discussing two issues in this PR:

a memory leak
a time-complexity regression compared to 4.x

The PR itself fixes only the memory leak, has been reviewed, and should indeed be merged asap (ideally in 5.0).

The time issue is less critical. We are talking about N=200 modules being loaded, and a memory leak proportional to N*2Mib is obvious to our users, whereas a N^2 time leak is less of a concern in practice.

Ensure that old frame descriptor tables are freed (cherry picked from commit b06f2d1)

gasche · 2023-01-22T10:43:54Z

It looks like the "time issue" is also causing a noticeable regression for Coq which also uses natdynlink heavily for its plugins -- see user report in #11926.

Currently, if I understand correctly, the code is doing as follows:

we maintain a global list of all "frametable fragments" that have been "registered" by the dynlink machinery
from a given state of the list we can compute a "frametable version", by linearly building a hash table from the list elements
when we compute a frametable version, we cache it by storing it in the corresponding node of the "frametable fragments" list
there is a "current frametable" which points to the latest frametable version, but mutators do not use atomic accesses to read the frametable (for efficiency reasons?) so they may access older frametable versions
mutators "update" their view of the "current frametable" at least once per major cycle; when the current frametable is more than one major cycle old, we free all previous frametable versions

(We don't want to (1) require locking when accessing the frametable, which is a performance-critical operation, or (2) try to mutate a hash table while mutators could access it, which requires a complex concurrent hashtable implementation -- or locking on access, see (1).)

My proposed strategy would be as follows:

we maintain a single global "frame descriptor" hash table, accessed without synchronization
whenever we want to update the table to register a new fragment, we perform a stop-the-world to ensure that all mutators are stopped

gasche · 2023-01-22T10:44:31Z

I think that it's a bit of work to implement and test, but not that much either. (We did something similar to resize the minor heap and it works okay.) The resulting code would in fact probably be simpler than the current one.

gadmm · 2023-01-22T14:59:53Z

Unfortunately a stop-the-world synchronization cannot remove the quadratic behaviour since the iteration responsible for the quadratic blowup is located in OCaml code. Instead of reallocating the frame descriptor table quadratically, you stop the world quadratically. In legacy single-threaded code this will show a speed-up, but the benefits in the long term are less clear (except for space).

I sketched another solution but I was not confident this was the right place to discuss the time complexity problem, so I replied at #11926 instead.

gasche · 2023-01-22T15:32:03Z

@gadmm I think that the quadratic behavior comes from the runtime code.

If we call caml_register_frametable N times:

ocaml/runtime/frame_descriptors.c

Lines 161 to 186 in eb04c8b

    
           void caml_register_frametable(intnat *table) 
        
           { 
        
             struct frametable_version *ft, *old; 
        
             caml_plat_lock(&descr_mutex); 
        
             frametables = cons(table, frametables); 
        
             old = (struct frametable_version*)atomic_load_acq(&current_frametable); 
        
             CAMLassert(old != NULL); 
        
             ft = caml_stat_alloc(sizeof(*ft)); 
        
             ft->table = build_frame_descriptors(frametables); 
        
             atomic_store_rel(&ft->free_prev_after_cycle, caml_major_cycles_completed); 
        
             ft->prev = old; 
        
             atomic_store_rel(&current_frametable, (uintnat)ft); 
        
             /* Ensure that we GC often enough to prevent more than 1/4 of 
        
                heap memory being stale frame tables */ 
        
             caml_adjust_gc_speed( 
        
                /* Size of the table just allocated */ 
        
                (sizeof(*ft) + sizeof(ft->table.descriptors[0]) * (ft->table.mask + 1)), 
        
                /* 1/4 of the heap size */ 
        
                caml_heap_size(Caml_state->shared_heap) / 4 
        
             ); 
        
             caml_plat_unlock(&descr_mutex); 
        
           }

then build_frame_descriptors is called N times on the current list of frametable fragments:

ocaml/runtime/frame_descriptors.c

Lines 71 to 72 in eb04c8b

    
           static caml_frame_descrs build_frame_descriptors(link* frametables) 
        
           {

and each call on a list of length K<=N iterates on each element to build a hash table:

ocaml/runtime/frame_descriptors.c

Lines 96 to 111 in eb04c8b

    
           /* Fill the hash table */ 
        
           iter_list(frametables,lnk) { 
        
             tbl = lnk->frametable; 
        
             len = *tbl; 
        
             d = (frame_descr *)(tbl + 1); 
        
             for (j = 0; j < len; j++) { 
        
               h = Hash_retaddr(d->retaddr, tblsize - 1); 
        
               while (table.descriptors[h] != NULL) { 
        
                 h = (h+1) & table.mask; 
        
               } 
        
               table.descriptors[h] = d; 
        
               if (j != len - 1) { 
        
                 d = next_frame_descr(d); 
        
               } 
        
             } 
        
           }

Am I missing something?

gasche · 2023-01-22T15:33:40Z

Instead with my implementation strategy we would have N stop-the-world pauses where each one "grows" the global hash table by adding the content of the new frametable fragment.

gadmm · 2023-01-22T16:02:05Z

Right, sorry, I did not mean to write that you stop the world quadratically (though I did), this would be even worse.

What I meant is that the iteration responsible for calling caml_register_frametable a lot seems to be here:

ocaml/otherlibs/dynlink/native/dynlink.ml

Lines 83 to 89 in eb04c8b

    
           List.iter (fun cu -> 
        
               try ndl_run handle cu 
        
               with exn -> 
        
                 Printexc.raise_with_backtrace 
        
                   (DT.Error (Library's_module_initializers_failed exn)) 
        
                   (Printexc.get_raw_backtrace ())) 
        
             (Unit_header.defined_symbols unit_header)

and it is not possible to place the stop-the-world section around all of this. So you can only replace each eventually-expensive call to build_frame_descriptor by an expensive call to stop the world. This is no longer N² but some NM, where: 1) M is very large (except in single-domain mode), 2) at least linear in the number in domains (if you count time across all domains). But I did not mean to make a formal complexity claim, only that the trade-off is not obvious.

I do not discard your solution, but I prefer one that makes caml_register_frametable inexpensive (being understood that the point of both solutions is to keep the cost low for the reader).

xavierleroy · 2023-01-22T16:04:17Z

We don't want to (1) require locking when accessing the frametable, which is a performance-critical operation

What about a reader-writer lock? GC root scanning and backtrace construction would lock for reading, and only dynlink would lock for writing.

gadmm · 2023-01-22T16:12:33Z

@stedolan was worried that the overheads of reference-counting would be too much for dealing with memory reclamation, so there would be the same (bigger?) worries for a reader-writer lock. But I am not convinced this would be too costly, I think this is a solution worth trying.

gasche · 2023-01-22T16:16:30Z

@gadmm thanks for your clarification.

it is not possible to place the stop-the-world section around [the iteration on all symbols]

Maybe we could move to a different API where we first "prepare" a bunch of frametables and then add them all at the same time (with a single pause). But that does make this approach more invasive.

I agree that trying the reader-writer lock first is a good choice.

stedolan · 2023-01-25T11:56:18Z

What about a reader-writer lock? GC root scanning and backtrace construction would lock for reading, and only dynlink would lock for writing.

Reader-writer locks cause contention even between readers, since the lock state must be atomically updated whenever a reader enters or exits the critical section. It's possible that we could find a coarse enough locking strategy for this to work OK, but just wrapping every use of the frametable won't work. (Maybe just a big reader critical section around GC pauses and backtrace collection is fine? However, backtrace collection can happen very frequently)

Maybe we could move to a different API where we first "prepare" a bunch of frametables and then add them all at the same time (with a single pause). But that does make this approach more invasive.

I think this is easy and worth doing in any case, I've opened #11935 to that effect.

gadmm · 2023-01-25T15:48:30Z

My intuition was that backtraces could be collected frequently, but not so frequently. I had in mind that the frametable was typically cold cache-wise, in which case the cost of an integer atomic update was negligible compared to other memory accesses. (The case where catchable exceptions are raised very often sounds like a case where you want to disable backtraces anyway.)

Still if you think the makeshift lock-free hash table could be useful despite this consideration and your separate improvement at #11935, let me know.

Ensure that old frame descriptor tables are freed

028febc

Octachron reviewed Oct 27, 2022

View reviewed changes

gasche approved these changes Oct 27, 2022

View reviewed changes

Octachron added this to the 5.0 milestone Oct 27, 2022

Octachron merged commit b06f2d1 into ocaml:trunk Nov 8, 2022

Octachron added a commit that referenced this pull request Nov 8, 2022

Merge pull request #11673 from stedolan/free-old-frame-tables

9689c90

Ensure that old frame descriptor tables are freed (cherry picked from commit b06f2d1)

Octachron mentioned this pull request Nov 8, 2022

Memory consumption of Frama-C in OCaml 5 beta1 is about 10x larger #11662

Closed

maroneze mentioned this pull request Nov 16, 2022

Time/memory regression in Frama-C with OCaml 5.0 #11733

Open

kayceesrk mentioned this pull request Nov 22, 2022

Weak array improvements #11743

Merged

Octachron mentioned this pull request Nov 25, 2022

Packages for OCaml 5.0.0 beta2 ocaml/opam-repository#22566

Merged

gasche mentioned this pull request Jan 22, 2023

Large slowdown of Coq with OCaml 5.0 #11926

Closed

This was referenced Jan 29, 2023

refactor runtime/frame_descriptors.c in preparation for quadratic-time fix #11972

Merged

fix quadratic behavior in natdynlink by using a STW section for frame-descriptor updates #11980

Merged

gadmm mentioned this pull request Nov 20, 2023

Remove the need of STW for frametable updates #12756

Open

Conversation

stedolan commented Oct 27, 2022

Uh oh!

stedolan commented Oct 27, 2022

Uh oh!

Octachron Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

stedolan Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

Octachron Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

Octachron Oct 27, 2022

Choose a reason for hiding this comment

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

gadmm commented Oct 27, 2022

Uh oh!

Octachron commented Oct 27, 2022

Uh oh!

stedolan commented Oct 31, 2022

Uh oh!

stedolan commented Oct 31, 2022

Uh oh!

stedolan commented Oct 31, 2022

Uh oh!

gasche commented Oct 31, 2022

Uh oh!

Octachron commented Nov 7, 2022

Uh oh!

gasche commented Nov 7, 2022

Uh oh!

gasche commented Jan 22, 2023

Uh oh!

gasche commented Jan 22, 2023

Uh oh!

gadmm commented Jan 22, 2023

Uh oh!

gasche commented Jan 22, 2023

Uh oh!

gasche commented Jan 22, 2023

Uh oh!

gadmm commented Jan 22, 2023

Uh oh!

xavierleroy commented Jan 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm commented Jan 22, 2023

Uh oh!

gasche commented Jan 22, 2023

Uh oh!

stedolan commented Jan 25, 2023

Uh oh!

gadmm commented Jan 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xavierleroy commented Jan 22, 2023 •

edited

Loading