Reimplement output_value using a hash table to detect sharing by xavierleroy · Pull Request #9353 · ocaml/ocaml

xavierleroy · 2020-03-08T09:22:08Z

Marshaling (output_value and related stdlib functions) needs to detect and preserve sharing in the OCaml value being serialized (unless the Marshal.No_sharing flag is given).

Until 2004 this was done by recording the addresses and output positions of already-seen blocks in a separate hash table.

In 2004, commit 1945d78, I wrote a more efficient implementation that modifies in-place the OCaml value to record already-seen blocks, then undoes the modifications at exit.

Now it's 2020, Multicore OCaml knocks at the door, and wants its hash table back, because those in-place modifications are incompatible with a concurrent GC.

This PR reimplements the pre-2004 solution, with a twist to the hash table implementation (using a bit vector to determine more quicky whether a hash table entry is empty) that improves performance a bit.

Performance is comparable with that of the current implementation for small values, but degrades sharply (3-4 times slower) for large values, when the hash table becomes large enough that nearly every access misses in all the caches. (A single load instruction, corresponding to the first probe in the table, can account for 30% of the total running time of marshaling!)

Here are some figures: Marshal.to_bytes of a perfect binary tree of depth N, with no internal sharing

 N  Old impl    New impl    Ratio
 2: 1.05e-07s   1.16e-07s   1.10 
 4: 2.91e-07s   3.17e-07s   1.09 
 8: 4.21e-06s   5.87e-06s   1.40 
12: 7.07e-05s   1.13e-04s   1.60 
16: 1.27e-03s   3.40e-03s   2.68 
20: 3.29e-02s   1.13e-01s   3.43 
24: 7.07e-01s   2.41e+00s   3.40

The slowdown is not due to excessive collisions in the hash table: on this test, hashing is nearly perfect, > 99% of hash table accesses hit on the first probe, 100% on the first or second probe.

Even though output_value becomes slower, it remains quite efficient overall. In particular, there was no observable slowdown on make world.opt, despite ocamlc and ocamlopt using output_value to produce .cmi and .cmx files.

gasche

I did not do a full review but I have a couple comments on the code, which is mixing abstraction levels in the serialization code (I'm not sure if this is important or just a stylistic choice).

The numbers look promising, in the sense that they look "good enough" to tolerate the slowdown if it is required for Multicore OCaml.

(For the performance-obsessed: when marshalling larger amount of data, I would assume that many of the objects are consecutive to each other. I wonder if there are performance gains to make from this assumption. I thought about storing intervals instead of points, but this makes lookup costly, and about recognizing when a page is full of marshalled objects, but it's not clear how to do this reliably.)

gasche · 2020-03-08T14:18:49Z

runtime/extern.c

+          }
+          goto next_item;
+        }
+        h = (h + 1) & pos_table.mask;


I find it a bit strange to have internal details of the table implementation (linear probing here and the bitvect shotpath above) exposed in the user code. Is there a good reason not to turn this into a lookup function (possibly marked inline for performance)?

The lookup function has a not-very-nice type when expressed in C, because morally it returns a sum type, Found of position-in-the-output | Notfound of hash-code-for-insertion. I pushed a tentative implementation along these lines. GCC inlines the function, generates slightly less good code than in the original PR, but it doesn't make a significant performance difference.

gasche · 2020-03-08T14:19:40Z

runtime/extern.c

+
+  /* Grow the table quickly (x 8) up to 10^6 entries,
+     more slowly (x 2) afterwards. */
+  if (old.size < 1000000) {


This could be a candidate for a macro-defined constant (instead of being hardcoded), if only as code hygiene.

I'll pass. The hard-coded constant is explained in the comment just above, I like that better than a standalone #define .

gasche · 2020-03-08T14:25:50Z

runtime/extern.c

+  pos_table.entries[h].obj = obj;
+  pos_table.entries[h].pos = obj_counter;
  obj_counter++;
+  if (obj_counter >= pos_table.threshold) extern_resize_position_table();


Again here, I wonder if the insertion logic could remain in the library code rather than being inlined in the user code.

Not sure what you mean. The obj_counter global plays two roles indeed: the occupancy of the table and the current output position, but it's not worth having two distinct variables for these two purposes and incrementing them both.

bobot · 2020-03-08T21:36:16Z

with a twist to the hash table implementation (using a bit vector to determine more quicky whether a hash table entry is empty) that improves performance a bit.

Could a bloom filter improve even more the performance for big hash-table by reducing the size of the bitvector?

gasche · 2020-03-08T21:43:33Z

My bet would be "no": a bloom filter has slower lookup than a bitvector, it would reduce memory usage but this is not the most memory-constrained part of the code. (Note: the bitvector tells which hash slots are filled, which is not the same as saying which keys are already in the table.) If you use a bloom filter, you also have to deal with false positives.

shakthimaan · 2020-03-09T12:52:54Z

I tested this PR changes for marshaling a binary tree with height 23, and this is 0.5 seconds faster (compared to PR#9293) with this bit vector implementation. If you can add the multicore-prerequisite label to this PR, it will be helpful for us to keep track of the same. Thanks for taking the time to create the PR. Appreciate it a lot!

xavierleroy · 2020-03-09T13:32:51Z

Coq, Frama-C and the TIS analyzer are heavy users of marshaling over large data structures, so I'm mentioning @ejgallego, @pascal-cuoq, and @bobot for their information.

xavierleroy · 2020-03-09T13:40:12Z

Re: Bloom filter, it could speed up the detection of objects that are not yet in the table (because of better locality), but then we need to insert the object in the big table, and for that we need to read from it... So it would be slower overall than just reading from the big table directly.

ejgallego · 2020-03-09T20:38:09Z

Coq, Frama-C and the TIS analyzer are heavy users of marshaling over large data structures, so I'm mentioning @ejgallego, @pascal-cuoq, and @bobot for their information.

Thank you @xavierleroy , indeed Coq is very sensitive to marshalling performance , I will launch a bench ASAP [do OPAM folks still generate switches for OCaml PRs?]

Also this does likely affect js_of_ocaml too (cc: @hhugo )

dbuenzli · 2020-03-09T22:15:34Z

[do OPAM folks still generate switches for OCaml PRs?]

I don't think so but with opam v2 you can just pin the ocaml-variants package. Something like this should do:

opam switch create ocaml-ov-patch --empty 
opam pin add ocaml-variants.4.11.0+ov-patch git+https://github.com/xavierleroy/ocaml#output-value-hash

P.S. Can someone fix HACKING.adoc my recent patch looks garbled as far as these instructions are concerned.

gasche · 2020-03-09T22:47:48Z

P.S. Can someone fix HACKING.adoc my recent patch looks garbled as far as these instructions are concerned.

Done.

ejgallego · 2020-03-09T23:14:01Z

I don't think so but with opam v2 you can just pin the ocaml-variants package.

Thanks @dbuenzli , unfortunately our bench infrastructure needs the ocaml-variants to be in a repos; but I've just updated the repos myself.

Coq seems to build fine with the PR, bench is running here: https://ci.inria.fr/coq/job/benchmark-part-of-the-branch/880/console

gasche · 2020-03-10T06:16:16Z

The bench results (OLD is before the PR, and NEW is after the PR) for Coq show no worrying performance regression; excluding a synthetic/artificial test, the biggest reported slowdown is +1.30% (coq-fiat-core). Oddly, one long-running package (coq-mathcomp-odd-order) is reportedly faster (-2.15%). I don't understand whether the measured speedup is a noise artefact; I don't see how the PR could make any program faster. (Well, programs using fork may benefit from writing to a separate location during serialization instead of trashing shared memory; but I don't think this package qualifies.)

On the other hand, there are clear memory-usage increases with this patch, with several packages in the +2%-+5% range. The worst case is coq-color at +7%, going from 1,150MB to 1,230MB. But then again, coq-fiat-parsers reports a -2% reduction in memory usage, so some of this may be just measurement noise.

To summarize: no worrying runtime regression, but possibly a small observable memory-usage increase.

ejgallego · 2020-03-10T06:33:45Z

Indeed perf numbers seem fine @gasche , I'd say the speedup in odd-order is real tho, 2.15% is significantly above the usual noise [+- 0.5%]

We can always repeat the bench for this development only, but I'd bet the speedup will remain. Performance profiles are indeed tricky, who knows what is going on.

pascal-cuoq · 2020-03-10T07:58:55Z

Thanks for the heads-up @xavierleroy.

TrustInSoft Analyzer has two options -save file and -load file that respectively use marshaling and unmarshaling. The marshaled and unmarshaled value can occupy most of the available RAM, which is typically 60GiB. Marshaling is done with OCaml's provided function, one of the functions that this PR modifies. Unmarshaling is done with a custom function that provide special services, as documented in Lightweight Typed Customizable Unmarshaling, because unmarshaled values must be re-maximally-shared on the fly and ideally with as low as possible of a memory use peak(*).

From the point of view of TrustInSoft Analyzer, the time it takes to marshal and unmarshal would ideally be as low as possible, yes. It's the speed of unmarshaling that matters most: marshaling usually comes after an analysis that has taken much longer to create the value to save in the first place than to write it to disk. The hard constraint is the memory use peak during marshaling.

We will run some experiments in the coming week.

(*) this is not only about preserving sharing during unmarshaling. Values that exist in memory after unmarshaling must be registered in their respective hashconsing tables, which are not saved because they contain no information. Also, values that already existed in memory and in the hashconsing tables before unmarshaling must have stayed where they were, and unmarshaled values that were identical to one of these must have been shared with it. This could all be done, using OCaml's unmarshal function, in two passes, but the point of the custom unmarshal function is to avoid the memory peak where the re-sharing is almost done, the re-shared copies have almost all been created, hash-tables to help with the process are full, and the value which was obtained from disk has not been freed yet.

xavierleroy · 2020-03-10T08:38:13Z

@pascal-cuoq we'll wait for your experiments. Marshaling time will probably increase, but memory could also be an issue. In the current implementation, the trail uses two words per unique object encountered. In the hash table-based implementation, the hash table use two words per entry, but it's occupation is kept between 1/3 and 2/3, hence the hash table can be 3 times as large as the trail.

gasche · 2020-03-10T09:23:14Z

Regarding the small performance improvement with odd-order: in fact this implementation change may improve performance in the case of objects with high sharing (as hash-consed Coq terms may be): to know whether a word is already serialized, we do not hit the main program memory anymore (where the object is), but a smaller table. This may increase locality in cases where the object is spread/fragmented in memory and lookups typically fall outside the cache, but the hashtable remains in the cache.

hhugo · 2020-03-10T09:39:39Z

Also this does likely affect js_of_ocaml too (cc: @hhugo )

jsoo does not marshal any value

pascal-cuoq · 2020-03-13T17:27:07Z

@xavierleroy the compilation of some of TrustInSoft Analyzer's dependencies with the bleeding edgest OCaml is turning out to be more complicated than anticipated. By Monday evening CET I will have results, or be able to guarantee a short additional delay for results, or be able to tell that we cannot provide feedback in a timely fashion on this branch.

xavierleroy · 2020-03-13T18:06:39Z

@pascal-cuoq No problem, take the time it takes to to evaluate this PR.

gasche · 2020-03-14T08:45:31Z

I think that it shouldn't be too hard to backport the change to an older OCaml release. @pascal-cuoq what is the latest released version that you can build with?

pascal-cuoq · 2020-03-23T16:06:20Z

Unfortunately, we missed our chance to put together the analyzer compiled with the marshal branch and the benchmark before the lockdown, and they are now a few meters apart on two separate computers that we are unable to access during the lockdown. We know that we managed to build enough of the analyzer for benchmarking just before the lockdown. It is difficult to tell how much more time it will take to rebuild it from scratch, especially since the lockdown has also created other work of its own.

I can only recommend that you go ahead, while we'll measure the consequences of the change asynchronously. We can always switch to a marshaling system that we have complete control of in case it turns out later that the new one is not acceptable for TrustInSoft Analyzer, and even this seems unlikely.

runtime/extern.c

xavierleroy · 2020-04-09T08:33:32Z

This PR is ready, I think, but I don't know what the release plans are. Should this be in 4.11 (despite the slowdown) or wait for later?

gasche · 2020-04-09T09:37:59Z

If the PR is ready and we have decided to include this, then I think we should merge in trunk right ahead, I don't see a good reason to wait for a PR that is ready.

(Right now there apparently is a conflict in the Changes file.)

xavierleroy · 2020-04-09T09:57:04Z

I don't see a good reason to wait for a PR that is ready.

One reason: because it degrades performance, and is not required until more of Multicore OCaml is merged.

gasche · 2020-04-09T12:22:12Z

On the other hand we won't really know by how much it degrades performance until our users try using it seriously. If it turns out we cannot afford the performance degradation, we would rather learn it sooner rather than later so that there are more options and less pressure.

If we are worried about the performance degradation, we can make this a OCAMLRUNPARAM-tunable option so that affected users have a way out. (But this opens a Pandora box of duplicating parts of the runtime between a sequential and a parallel-friendly version.)

avsm · 2020-04-16T15:01:39Z

Developer meeting roundup: consensus is to merge for 4.11. @xavierleroy to do so soon.

The previous implementation was doing temporary in-place modifications of the OCaml value being marshaled. This is incompatible with Multicore OCaml.

mbouaziz · 2020-04-30T08:58:06Z

Tagging @jvillard @jberdine @ngorogiannis for Infer, heavy user of marshalling too.
Hack/Flow/Zoncolan/Pyre/Pysa may be affected too.

xavierleroy mentioned this pull request Mar 8, 2020

Use addrmap hash table for marshaling #9293

Closed

gasche reviewed Mar 8, 2020

View reviewed changes

xavierleroy added the multicore-prerequisite label Mar 9, 2020

xavierleroy force-pushed the output-value-hash branch from 4308b54 to 9449c21 Compare March 9, 2020 18:14

kayceesrk mentioned this pull request Mar 24, 2020

Marshalling and unmarshalling ocaml-multicore/ocaml-multicore#75

Closed

stedolan reviewed Mar 24, 2020

View reviewed changes

runtime/extern.c Show resolved Hide resolved

runtime/extern.c Outdated Show resolved Hide resolved

stedolan approved these changes Mar 30, 2020

View reviewed changes

xavierleroy mentioned this pull request Apr 8, 2020

Minor fixes to the marshaler (output_value) #9420

Merged

xavierleroy force-pushed the output-value-hash branch 2 times, most recently from ceac569 to 71c5b63 Compare April 8, 2020 10:30

avsm added this to the 4.11 milestone Apr 16, 2020

Reimplement output_value using a hash table to detect sharing

d3bd02c

The previous implementation was doing temporary in-place modifications of the OCaml value being marshaled. This is incompatible with Multicore OCaml.

xavierleroy force-pushed the output-value-hash branch from 71c5b63 to d3bd02c Compare April 16, 2020 15:54

xavierleroy merged commit 67ada54 into ocaml:trunk Apr 16, 2020

xavierleroy deleted the output-value-hash branch April 16, 2020 15:56

xavierleroy mentioned this pull request Jun 14, 2020

Reimplement Obj.reachable_word using a hash table to detect sharing #9678

Merged

alainfrisch mentioned this pull request Mar 16, 2021

[feature request] support for reading/writing OCaml values to shared memory #10272

Closed

dra27 mentioned this pull request Apr 5, 2022

Always serialize doubles in little-endian #11163

Closed

Conversation

xavierleroy commented Mar 8, 2020

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

gasche Mar 8, 2020

Choose a reason for hiding this comment

Uh oh!

xavierleroy Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

gasche Mar 8, 2020

Choose a reason for hiding this comment

Uh oh!

xavierleroy Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

gasche Mar 8, 2020

Choose a reason for hiding this comment

Uh oh!

xavierleroy Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

bobot commented Mar 8, 2020

Uh oh!

gasche commented Mar 8, 2020

Uh oh!

shakthimaan commented Mar 9, 2020

Uh oh!

xavierleroy commented Mar 9, 2020

Uh oh!

xavierleroy commented Mar 9, 2020

Uh oh!

ejgallego commented Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbuenzli commented Mar 9, 2020

Uh oh!

gasche commented Mar 9, 2020

Uh oh!

ejgallego commented Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Mar 10, 2020

Uh oh!

ejgallego commented Mar 10, 2020

Uh oh!

pascal-cuoq commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Mar 10, 2020

Uh oh!

gasche commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhugo commented Mar 10, 2020

Uh oh!

pascal-cuoq commented Mar 13, 2020

Uh oh!

xavierleroy commented Mar 13, 2020

Uh oh!

gasche commented Mar 14, 2020

Uh oh!

pascal-cuoq commented Mar 23, 2020

Uh oh!

Uh oh!

Uh oh!

xavierleroy commented Apr 9, 2020

Uh oh!

gasche commented Apr 9, 2020

Uh oh!

xavierleroy commented Apr 9, 2020

Uh oh!

gasche commented Apr 9, 2020

Uh oh!

avsm commented Apr 16, 2020

Uh oh!

mbouaziz commented Apr 30, 2020

Uh oh!

ejgallego commented Mar 9, 2020 •

edited

Loading

ejgallego commented Mar 9, 2020 •

edited

Loading

pascal-cuoq commented Mar 10, 2020 •

edited

Loading

gasche commented Mar 10, 2020 •

edited

Loading