Skip to content

Reimplement output_value using a hash table to detect sharing#9353

Merged
xavierleroy merged 1 commit intoocaml:trunkfrom
xavierleroy:output-value-hash
Apr 16, 2020
Merged

Reimplement output_value using a hash table to detect sharing#9353
xavierleroy merged 1 commit intoocaml:trunkfrom
xavierleroy:output-value-hash

Conversation

@xavierleroy
Copy link
Copy Markdown
Contributor

Marshaling (output_value and related stdlib functions) needs to detect and preserve sharing in the OCaml value being serialized (unless the Marshal.No_sharing flag is given).

Until 2004 this was done by recording the addresses and output positions of already-seen blocks in a separate hash table.

In 2004, commit 1945d78, I wrote a more efficient implementation that modifies in-place the OCaml value to record already-seen blocks, then undoes the modifications at exit.

Now it's 2020, Multicore OCaml knocks at the door, and wants its hash table back, because those in-place modifications are incompatible with a concurrent GC.

This PR reimplements the pre-2004 solution, with a twist to the hash table implementation (using a bit vector to determine more quicky whether a hash table entry is empty) that improves performance a bit.

Performance is comparable with that of the current implementation for small values, but degrades sharply (3-4 times slower) for large values, when the hash table becomes large enough that nearly every access misses in all the caches. (A single load instruction, corresponding to the first probe in the table, can account for 30% of the total running time of marshaling!)

Here are some figures: Marshal.to_bytes of a perfect binary tree of depth N, with no internal sharing

 N  Old impl    New impl    Ratio
 2: 1.05e-07s   1.16e-07s   1.10 
 4: 2.91e-07s   3.17e-07s   1.09 
 8: 4.21e-06s   5.87e-06s   1.40 
12: 7.07e-05s   1.13e-04s   1.60 
16: 1.27e-03s   3.40e-03s   2.68 
20: 3.29e-02s   1.13e-01s   3.43 
24: 7.07e-01s   2.41e+00s   3.40 

The slowdown is not due to excessive collisions in the hash table: on this test, hashing is nearly perfect, > 99% of hash table accesses hit on the first probe, 100% on the first or second probe.

Even though output_value becomes slower, it remains quite efficient overall. In particular, there was no observable slowdown on make world.opt, despite ocamlc and ocamlopt using output_value to produce .cmi and .cmx files.

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not do a full review but I have a couple comments on the code, which is mixing abstraction levels in the serialization code (I'm not sure if this is important or just a stylistic choice).

The numbers look promising, in the sense that they look "good enough" to tolerate the slowdown if it is required for Multicore OCaml.

(For the performance-obsessed: when marshalling larger amount of data, I would assume that many of the objects are consecutive to each other. I wonder if there are performance gains to make from this assumption. I thought about storing intervals instead of points, but this makes lookup costly, and about recognizing when a page is full of marshalled objects, but it's not clear how to do this reliably.)

runtime/extern.c Outdated
}
goto next_item;
}
h = (h + 1) & pos_table.mask;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit strange to have internal details of the table implementation (linear probing here and the bitvect shotpath above) exposed in the user code. Is there a good reason not to turn this into a lookup function (possibly marked inline for performance)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup function has a not-very-nice type when expressed in C, because morally it returns a sum type, Found of position-in-the-output | Notfound of hash-code-for-insertion. I pushed a tentative implementation along these lines. GCC inlines the function, generates slightly less good code than in the original PR, but it doesn't make a significant performance difference.


/* Grow the table quickly (x 8) up to 10^6 entries,
more slowly (x 2) afterwards. */
if (old.size < 1000000) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a candidate for a macro-defined constant (instead of being hardcoded), if only as code hygiene.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll pass. The hard-coded constant is explained in the comment just above, I like that better than a standalone #define .

pos_table.entries[h].obj = obj;
pos_table.entries[h].pos = obj_counter;
obj_counter++;
if (obj_counter >= pos_table.threshold) extern_resize_position_table();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, I wonder if the insertion logic could remain in the library code rather than being inlined in the user code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. The obj_counter global plays two roles indeed: the occupancy of the table and the current output position, but it's not worth having two distinct variables for these two purposes and incrementing them both.

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Mar 8, 2020

with a twist to the hash table implementation (using a bit vector to determine more quicky whether a hash table entry is empty) that improves performance a bit.

Could a bloom filter improve even more the performance for big hash-table by reducing the size of the bitvector?

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 8, 2020

My bet would be "no": a bloom filter has slower lookup than a bitvector, it would reduce memory usage but this is not the most memory-constrained part of the code. (Note: the bitvector tells which hash slots are filled, which is not the same as saying which keys are already in the table.) If you use a bloom filter, you also have to deal with false positives.

@shakthimaan
Copy link
Copy Markdown
Contributor

I tested this PR changes for marshaling a binary tree with height 23, and this is 0.5 seconds faster (compared to PR#9293) with this bit vector implementation. If you can add the multicore-prerequisite label to this PR, it will be helpful for us to keep track of the same. Thanks for taking the time to create the PR. Appreciate it a lot!

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Coq, Frama-C and the TIS analyzer are heavy users of marshaling over large data structures, so I'm mentioning @ejgallego, @pascal-cuoq, and @bobot for their information.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Re: Bloom filter, it could speed up the detection of objects that are not yet in the table (because of better locality), but then we need to insert the object in the big table, and for that we need to read from it... So it would be slower overall than just reading from the big table directly.

@ejgallego
Copy link
Copy Markdown

ejgallego commented Mar 9, 2020

Coq, Frama-C and the TIS analyzer are heavy users of marshaling over large data structures, so I'm mentioning @ejgallego, @pascal-cuoq, and @bobot for their information.

Thank you @xavierleroy , indeed Coq is very sensitive to marshalling performance , I will launch a bench ASAP [do OPAM folks still generate switches for OCaml PRs?]

Also this does likely affect js_of_ocaml too (cc: @hhugo )

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Mar 9, 2020

[do OPAM folks still generate switches for OCaml PRs?]

I don't think so but with opam v2 you can just pin the ocaml-variants package. Something like this should do:

opam switch create ocaml-ov-patch --empty 
opam pin add ocaml-variants.4.11.0+ov-patch git+https://github.com/xavierleroy/ocaml#output-value-hash

P.S. Can someone fix HACKING.adoc my recent patch looks garbled as far as these instructions are concerned.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 9, 2020

P.S. Can someone fix HACKING.adoc my recent patch looks garbled as far as these instructions are concerned.

Done.

@ejgallego
Copy link
Copy Markdown

ejgallego commented Mar 9, 2020

I don't think so but with opam v2 you can just pin the ocaml-variants package.

Thanks @dbuenzli , unfortunately our bench infrastructure needs the ocaml-variants to be in a repos; but I've just updated the repos myself.

Coq seems to build fine with the PR, bench is running here: https://ci.inria.fr/coq/job/benchmark-part-of-the-branch/880/console

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 10, 2020

The bench results (OLD is before the PR, and NEW is after the PR) for Coq show no worrying performance regression; excluding a synthetic/artificial test, the biggest reported slowdown is +1.30% (coq-fiat-core). Oddly, one long-running package (coq-mathcomp-odd-order) is reportedly faster (-2.15%). I don't understand whether the measured speedup is a noise artefact; I don't see how the PR could make any program faster. (Well, programs using fork may benefit from writing to a separate location during serialization instead of trashing shared memory; but I don't think this package qualifies.)

On the other hand, there are clear memory-usage increases with this patch, with several packages in the +2%-+5% range. The worst case is coq-color at +7%, going from 1,150MB to 1,230MB. But then again, coq-fiat-parsers reports a -2% reduction in memory usage, so some of this may be just measurement noise.

To summarize: no worrying runtime regression, but possibly a small observable memory-usage increase.

@ejgallego
Copy link
Copy Markdown

Indeed perf numbers seem fine @gasche , I'd say the speedup in odd-order is real tho, 2.15% is significantly above the usual noise [+- 0.5%]

We can always repeat the bench for this development only, but I'd bet the speedup will remain. Performance profiles are indeed tricky, who knows what is going on.

@pascal-cuoq
Copy link
Copy Markdown

pascal-cuoq commented Mar 10, 2020

Thanks for the heads-up @xavierleroy.

TrustInSoft Analyzer has two options -save file and -load file that respectively use marshaling and unmarshaling. The marshaled and unmarshaled value can occupy most of the available RAM, which is typically 60GiB. Marshaling is done with OCaml's provided function, one of the functions that this PR modifies. Unmarshaling is done with a custom function that provide special services, as documented in Lightweight Typed Customizable Unmarshaling, because unmarshaled values must be re-maximally-shared on the fly and ideally with as low as possible of a memory use peak(*).

From the point of view of TrustInSoft Analyzer, the time it takes to marshal and unmarshal would ideally be as low as possible, yes. It's the speed of unmarshaling that matters most: marshaling usually comes after an analysis that has taken much longer to create the value to save in the first place than to write it to disk. The hard constraint is the memory use peak during marshaling.

We will run some experiments in the coming week.

(*) this is not only about preserving sharing during unmarshaling. Values that exist in memory after unmarshaling must be registered in their respective hashconsing tables, which are not saved because they contain no information. Also, values that already existed in memory and in the hashconsing tables before unmarshaling must have stayed where they were, and unmarshaled values that were identical to one of these must have been shared with it. This could all be done, using OCaml's unmarshal function, in two passes, but the point of the custom unmarshal function is to avoid the memory peak where the re-sharing is almost done, the re-shared copies have almost all been created, hash-tables to help with the process are full, and the value which was obtained from disk has not been freed yet.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

@pascal-cuoq we'll wait for your experiments. Marshaling time will probably increase, but memory could also be an issue. In the current implementation, the trail uses two words per unique object encountered. In the hash table-based implementation, the hash table use two words per entry, but it's occupation is kept between 1/3 and 2/3, hence the hash table can be 3 times as large as the trail.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 10, 2020

Regarding the small performance improvement with odd-order: in fact this implementation change may improve performance in the case of objects with high sharing (as hash-consed Coq terms may be): to know whether a word is already serialized, we do not hit the main program memory anymore (where the object is), but a smaller table. This may increase locality in cases where the object is spread/fragmented in memory and lookups typically fall outside the cache, but the hashtable remains in the cache.

@hhugo
Copy link
Copy Markdown
Contributor

hhugo commented Mar 10, 2020

Also this does likely affect js_of_ocaml too (cc: @hhugo )

jsoo does not marshal any value

@pascal-cuoq
Copy link
Copy Markdown

@xavierleroy the compilation of some of TrustInSoft Analyzer's dependencies with the bleeding edgest OCaml is turning out to be more complicated than anticipated. By Monday evening CET I will have results, or be able to guarantee a short additional delay for results, or be able to tell that we cannot provide feedback in a timely fashion on this branch.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

@pascal-cuoq No problem, take the time it takes to to evaluate this PR.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 14, 2020

I think that it shouldn't be too hard to backport the change to an older OCaml release. @pascal-cuoq what is the latest released version that you can build with?

@pascal-cuoq
Copy link
Copy Markdown

Unfortunately, we missed our chance to put together the analyzer compiled with the marshal branch and the benchmark before the lockdown, and they are now a few meters apart on two separate computers that we are unable to access during the lockdown. We know that we managed to build enough of the analyzer for benchmarking just before the lockdown. It is difficult to tell how much more time it will take to rebuild it from scratch, especially since the lockdown has also created other work of its own.

I can only recommend that you go ahead, while we'll measure the consequences of the change asynchronously. We can always switch to a marshaling system that we have complete control of in case it turns out later that the new one is not acceptable for TrustInSoft Analyzer, and even this seems unlikely.

@xavierleroy xavierleroy force-pushed the output-value-hash branch 2 times, most recently from ceac569 to 71c5b63 Compare April 8, 2020 10:30
@xavierleroy
Copy link
Copy Markdown
Contributor Author

This PR is ready, I think, but I don't know what the release plans are. Should this be in 4.11 (despite the slowdown) or wait for later?

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 9, 2020

If the PR is ready and we have decided to include this, then I think we should merge in trunk right ahead, I don't see a good reason to wait for a PR that is ready.

(Right now there apparently is a conflict in the Changes file.)

@xavierleroy
Copy link
Copy Markdown
Contributor Author

I don't see a good reason to wait for a PR that is ready.

One reason: because it degrades performance, and is not required until more of Multicore OCaml is merged.

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 9, 2020

On the other hand we won't really know by how much it degrades performance until our users try using it seriously. If it turns out we cannot afford the performance degradation, we would rather learn it sooner rather than later so that there are more options and less pressure.

If we are worried about the performance degradation, we can make this a OCAMLRUNPARAM-tunable option so that affected users have a way out. (But this opens a Pandora box of duplicating parts of the runtime between a sequential and a parallel-friendly version.)

@avsm avsm added this to the 4.11 milestone Apr 16, 2020
@avsm
Copy link
Copy Markdown
Member

avsm commented Apr 16, 2020

Developer meeting roundup: consensus is to merge for 4.11. @xavierleroy to do so soon.

The previous implementation was doing temporary in-place modifications
of the OCaml value being marshaled.  This is incompatible with
Multicore OCaml.
@xavierleroy xavierleroy merged commit 67ada54 into ocaml:trunk Apr 16, 2020
@xavierleroy xavierleroy deleted the output-value-hash branch April 16, 2020 15:56
@mbouaziz
Copy link
Copy Markdown
Contributor

Tagging @jvillard @jberdine @ngorogiannis for Infer, heavy user of marshalling too.
Hack/Flow/Zoncolan/Pyre/Pysa may be affected too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.