Optimize reallocation of caml_frame_descriptors.#156
Optimize reallocation of caml_frame_descriptors.#156chambart wants to merge 5 commits intoocaml:trunkfrom
Conversation
|
+1 for fixing this issue I've noticed the bad behavior when I was debugging a surprisingly long startup time of one of our programs at Jane Street (although this particular thing wasn't the main culprit). The caml_frame_descriptors table can be really big when you use big libraries (like Core or Async, they have many functions). Rehashing this whole table every time a small module is dynlinked can be quite expensive, not to mention quadratic behavior when you load many modules. |
|
That's impressive! We remarked in frama-c that the statup was slow but we never imagined that was because of this. Thanks a lot for finding it and for fixing it! I tried to review the code, but I had trouble to understand the different states of the book keeping of the hashtable. So here an attempt to explicit the different states: bobot/ocaml@878819d, bobot/ocaml@e65f499 (branch optimizes_caml_init_frame_descriptor) . Feel free to use them (or not). Except that, the code is neat and a lot more efficient! |
|
Yes, it's a good idea to grow the hash table only when needed. I wonder, however, if the code couldn't be simplified further by having caml_register_frametable enter the frames directly in the hash table,, rather than maintaining a list of frame tables. The resizing and insertion logic would go into caml_register_frametable. In the resize case, the contents of the old hash table are reinserted in the new, bigger one. Also, it would be nice to be able to unregister a frametable, removing its entries from the hash table. We can't use this (safely) for natdynlink (yet?), but the Coqonut project (a JIT for Coq) could make use of it. |
The difference is startup time for very-short-lived programs that don't allocate enough to start a major collection before they exit. Not sure we should care about those. |
|
Even minor collections need the frame table to be properly set up. So, 2015-03-24 20:55 GMT+01:00 Damien Doligez notifications@github.com:
|
|
@xavierleroy I'll try that simplification. Maybe one way to have both the lazy behavior for non-allocating programs and a simpler situation would be to continue to do the first initialisation lazily but directly update the table on dynlink. For code unloading, it's probably possible to lazily reconstruct the table when the used part of the table is below 1/8 (to avoid reallocating if the size oscillate around a power of 2). But I wouldn't want to try that if there is no code using it yet. |
|
I don't think you need to bother with shrinking the table when unloading code, but don't you need to rebuild the table whenever you remove some entries? And I don't think we really care about lazy initialization. |
It now only reallocate and reinitialise the whole table only when it is too small. This avoids quadratic behavior when loading a lot of module with dynlink.
If some code was unloaded, the frame table can be freed, so we cannot dereference the frame_descr pointer in the table. To avoid that problem, the return address is copied to the frame table.
394f90d to
138bd93
Compare
|
I got rid of the lazy initialization and couldn't resist trying to allow unloading. |
|
To allow unloading I needed to store the return addresses directly in the table and to add a check in the insertion. The check is probably almost free, but this increase the size of the table. Note that it also means that collecting the local gc roots may be a bit cheaper. |
|
No, we don't want to double the size of this hashtable. Think of the caches... I'm not sure why you think you need to put the retaddr in the hash table. Deletion in a linearly-hashed table is painful but can be done without extra info, see e.g. TAOCP vol 3 section 6.4 algorithm R. |
|
It's bigger, but that also means that we don't need to follow some pointers when we walk the stack. That's probably avoid trashing parts of the cache with some useless frame descriptors. Out of curiosity, I benchmark it a bit and it seems to be consistently slightly faster (less than 5%) on exception or gc heavy workload. But I don't really propose that part for inclusion, only the first part. |
|
It's not just the caches: in the case that @milanst investigated, the table's size was 16M and most Jane Street programs are probably at this order of magnitude. You don't want to double that if you can avoid it. |
|
Removal was not tested at all. |
|
What is the status of this patch? My understanding is that @xavierleroy 's request for unloading frametables is implemented, and that the table is not doubled anymore. Would anyone be willing to review the resulting patch? |
asmrun/roots.c
Outdated
|
Merged in trunk, thanks for the patch! |
|
One of our customers has experienced segfaults when dynlinking a lot of code, and my colleague Roven Gabriel has tracked this down to this PR as the probable cause for the segfault. Looking at the commits, two things seem fishy to me:
I've submitted #279 as an attempt to fix these issues. |
|
@alainfrisch I will try to take some time reread it seriously and verify your patch soon. |
|
I confirm this patch is the cause of some segfaults; we've seen this on Coq with flambda at high inlining levels. |
* Fix leaked file descriptor when querying package documentation * Enable info logs in production
It now only reallocate and reinitialise the whole table only when it
is too small. This avoids quadratic behavior when loading a lot of
module with dynlink.
This in problematic on frama-c when inlining increase the code size. The frame table
initialisation took ~0.5 second. This is quite noticeable on real examples where the
whole frama-c analysis is ~1.5s long.