Reimplement output_value using a hash table to detect sharing#9353
Reimplement output_value using a hash table to detect sharing#9353xavierleroy merged 1 commit intoocaml:trunkfrom
Conversation
gasche
left a comment
There was a problem hiding this comment.
I did not do a full review but I have a couple comments on the code, which is mixing abstraction levels in the serialization code (I'm not sure if this is important or just a stylistic choice).
The numbers look promising, in the sense that they look "good enough" to tolerate the slowdown if it is required for Multicore OCaml.
(For the performance-obsessed: when marshalling larger amount of data, I would assume that many of the objects are consecutive to each other. I wonder if there are performance gains to make from this assumption. I thought about storing intervals instead of points, but this makes lookup costly, and about recognizing when a page is full of marshalled objects, but it's not clear how to do this reliably.)
runtime/extern.c
Outdated
| } | ||
| goto next_item; | ||
| } | ||
| h = (h + 1) & pos_table.mask; |
There was a problem hiding this comment.
I find it a bit strange to have internal details of the table implementation (linear probing here and the bitvect shotpath above) exposed in the user code. Is there a good reason not to turn this into a lookup function (possibly marked inline for performance)?
There was a problem hiding this comment.
The lookup function has a not-very-nice type when expressed in C, because morally it returns a sum type, Found of position-in-the-output | Notfound of hash-code-for-insertion. I pushed a tentative implementation along these lines. GCC inlines the function, generates slightly less good code than in the original PR, but it doesn't make a significant performance difference.
|
|
||
| /* Grow the table quickly (x 8) up to 10^6 entries, | ||
| more slowly (x 2) afterwards. */ | ||
| if (old.size < 1000000) { |
There was a problem hiding this comment.
This could be a candidate for a macro-defined constant (instead of being hardcoded), if only as code hygiene.
There was a problem hiding this comment.
I'll pass. The hard-coded constant is explained in the comment just above, I like that better than a standalone #define .
| pos_table.entries[h].obj = obj; | ||
| pos_table.entries[h].pos = obj_counter; | ||
| obj_counter++; | ||
| if (obj_counter >= pos_table.threshold) extern_resize_position_table(); |
There was a problem hiding this comment.
Again here, I wonder if the insertion logic could remain in the library code rather than being inlined in the user code.
There was a problem hiding this comment.
Not sure what you mean. The obj_counter global plays two roles indeed: the occupancy of the table and the current output position, but it's not worth having two distinct variables for these two purposes and incrementing them both.
Could a bloom filter improve even more the performance for big hash-table by reducing the size of the bitvector? |
|
My bet would be "no": a bloom filter has slower lookup than a bitvector, it would reduce memory usage but this is not the most memory-constrained part of the code. (Note: the bitvector tells which hash slots are filled, which is not the same as saying which keys are already in the table.) If you use a bloom filter, you also have to deal with false positives. |
|
I tested this PR changes for marshaling a binary tree with height 23, and this is 0.5 seconds faster (compared to PR#9293) with this bit vector implementation. If you can add the |
|
Coq, Frama-C and the TIS analyzer are heavy users of marshaling over large data structures, so I'm mentioning @ejgallego, @pascal-cuoq, and @bobot for their information. |
|
Re: Bloom filter, it could speed up the detection of objects that are not yet in the table (because of better locality), but then we need to insert the object in the big table, and for that we need to read from it... So it would be slower overall than just reading from the big table directly. |
4308b54 to
9449c21
Compare
Thank you @xavierleroy , indeed Coq is very sensitive to marshalling performance , I will launch a bench ASAP [do OPAM folks still generate switches for OCaml PRs?] Also this does likely affect js_of_ocaml too (cc: @hhugo ) |
I don't think so but with P.S. Can someone fix |
Done. |
Thanks @dbuenzli , unfortunately our bench infrastructure needs the ocaml-variants to be in a repos; but I've just updated the repos myself. Coq seems to build fine with the PR, bench is running here: https://ci.inria.fr/coq/job/benchmark-part-of-the-branch/880/console |
|
The bench results (OLD is before the PR, and NEW is after the PR) for Coq show no worrying performance regression; excluding a synthetic/artificial test, the biggest reported slowdown is +1.30% (coq-fiat-core). Oddly, one long-running package (coq-mathcomp-odd-order) is reportedly faster (-2.15%). I don't understand whether the measured speedup is a noise artefact; I don't see how the PR could make any program faster. (Well, programs using On the other hand, there are clear memory-usage increases with this patch, with several packages in the +2%-+5% range. The worst case is coq-color at +7%, going from 1,150MB to 1,230MB. But then again, coq-fiat-parsers reports a -2% reduction in memory usage, so some of this may be just measurement noise. To summarize: no worrying runtime regression, but possibly a small observable memory-usage increase. |
|
Indeed perf numbers seem fine @gasche , I'd say the speedup in odd-order is real tho, 2.15% is significantly above the usual noise [+- 0.5%] We can always repeat the bench for this development only, but I'd bet the speedup will remain. Performance profiles are indeed tricky, who knows what is going on. |
|
Thanks for the heads-up @xavierleroy. TrustInSoft Analyzer has two options From the point of view of TrustInSoft Analyzer, the time it takes to marshal and unmarshal would ideally be as low as possible, yes. It's the speed of unmarshaling that matters most: marshaling usually comes after an analysis that has taken much longer to create the value to save in the first place than to write it to disk. The hard constraint is the memory use peak during marshaling. We will run some experiments in the coming week. (*) this is not only about preserving sharing during unmarshaling. Values that exist in memory after unmarshaling must be registered in their respective hashconsing tables, which are not saved because they contain no information. Also, values that already existed in memory and in the hashconsing tables before unmarshaling must have stayed where they were, and unmarshaled values that were identical to one of these must have been shared with it. This could all be done, using OCaml's unmarshal function, in two passes, but the point of the custom unmarshal function is to avoid the memory peak where the re-sharing is almost done, the re-shared copies have almost all been created, hash-tables to help with the process are full, and the value which was obtained from disk has not been freed yet. |
|
@pascal-cuoq we'll wait for your experiments. Marshaling time will probably increase, but memory could also be an issue. In the current implementation, the trail uses two words per unique object encountered. In the hash table-based implementation, the hash table use two words per entry, but it's occupation is kept between 1/3 and 2/3, hence the hash table can be 3 times as large as the trail. |
|
Regarding the small performance improvement with odd-order: in fact this implementation change may improve performance in the case of objects with high sharing (as hash-consed Coq terms may be): to know whether a word is already serialized, we do not hit the main program memory anymore (where the object is), but a smaller table. This may increase locality in cases where the object is spread/fragmented in memory and lookups typically fall outside the cache, but the hashtable remains in the cache. |
jsoo does not marshal any value |
|
@xavierleroy the compilation of some of TrustInSoft Analyzer's dependencies with the bleeding edgest OCaml is turning out to be more complicated than anticipated. By Monday evening CET I will have results, or be able to guarantee a short additional delay for results, or be able to tell that we cannot provide feedback in a timely fashion on this branch. |
|
@pascal-cuoq No problem, take the time it takes to to evaluate this PR. |
|
I think that it shouldn't be too hard to backport the change to an older OCaml release. @pascal-cuoq what is the latest released version that you can build with? |
|
Unfortunately, we missed our chance to put together the analyzer compiled with the marshal branch and the benchmark before the lockdown, and they are now a few meters apart on two separate computers that we are unable to access during the lockdown. We know that we managed to build enough of the analyzer for benchmarking just before the lockdown. It is difficult to tell how much more time it will take to rebuild it from scratch, especially since the lockdown has also created other work of its own. I can only recommend that you go ahead, while we'll measure the consequences of the change asynchronously. We can always switch to a marshaling system that we have complete control of in case it turns out later that the new one is not acceptable for TrustInSoft Analyzer, and even this seems unlikely. |
ceac569 to
71c5b63
Compare
|
This PR is ready, I think, but I don't know what the release plans are. Should this be in 4.11 (despite the slowdown) or wait for later? |
|
If the PR is ready and we have decided to include this, then I think we should merge in trunk right ahead, I don't see a good reason to wait for a PR that is ready. (Right now there apparently is a conflict in the Changes file.) |
One reason: because it degrades performance, and is not required until more of Multicore OCaml is merged. |
|
On the other hand we won't really know by how much it degrades performance until our users try using it seriously. If it turns out we cannot afford the performance degradation, we would rather learn it sooner rather than later so that there are more options and less pressure. If we are worried about the performance degradation, we can make this a OCAMLRUNPARAM-tunable option so that affected users have a way out. (But this opens a Pandora box of duplicating parts of the runtime between a sequential and a parallel-friendly version.) |
|
Developer meeting roundup: consensus is to merge for 4.11. @xavierleroy to do so soon. |
The previous implementation was doing temporary in-place modifications of the OCaml value being marshaled. This is incompatible with Multicore OCaml.
71c5b63 to
d3bd02c
Compare
|
Tagging @jvillard @jberdine @ngorogiannis for Infer, heavy user of marshalling too. |
Marshaling (
output_valueand related stdlib functions) needs to detect and preserve sharing in the OCaml value being serialized (unless theMarshal.No_sharingflag is given).Until 2004 this was done by recording the addresses and output positions of already-seen blocks in a separate hash table.
In 2004, commit 1945d78, I wrote a more efficient implementation that modifies in-place the OCaml value to record already-seen blocks, then undoes the modifications at exit.
Now it's 2020, Multicore OCaml knocks at the door, and wants its hash table back, because those in-place modifications are incompatible with a concurrent GC.
This PR reimplements the pre-2004 solution, with a twist to the hash table implementation (using a bit vector to determine more quicky whether a hash table entry is empty) that improves performance a bit.
Performance is comparable with that of the current implementation for small values, but degrades sharply (3-4 times slower) for large values, when the hash table becomes large enough that nearly every access misses in all the caches. (A single load instruction, corresponding to the first probe in the table, can account for 30% of the total running time of marshaling!)
Here are some figures:
Marshal.to_bytesof a perfect binary tree of depth N, with no internal sharingThe slowdown is not due to excessive collisions in the hash table: on this test, hashing is nearly perfect, > 99% of hash table accesses hit on the first probe, 100% on the first or second probe.
Even though
output_valuebecomes slower, it remains quite efficient overall. In particular, there was no observable slowdown onmake world.opt, despite ocamlc and ocamlopt usingoutput_valueto produce .cmi and .cmx files.