Use addrmap hash table for marshaling#9293
Conversation
|
(I just pushed an adjustment which should allow AppVeyor to complete) |
dra27
left a comment
There was a problem hiding this comment.
A few quick housekeeping comments on the headers, rather than a full review I'm afraid
|
I find it hard to draw any conclusion from the performance numbers given here. If I understood correctly, the benchmark suite does not test marshalling much (so the fact that the numbers are identical is not very informative). The compiler does rely on marshalling a bit (to create and read It would be useful to have a microbenchmark to measure marshalling speed regressions, so that people can get a quick estimate of potential performance regression for their workload if they know the fraction of time spent in marshalling. It may also be interesting to consider adding marshalling examples to the benchmark suite, for example a simple message-passing program (that just passes messages to itself for benchmarking purposes). |
|
opam's state-loading mechanism can be turned fairly quickly into a micro-benchmark (see https://github.com/ocaml/opam/blob/master/src/state/opamRepositoryState.mli#L20). It's at least an important benchmark as it's used on virtually every |
|
Hack (https://hacklang.org/) and Flow (http://flow.org/) both use the Marshal module extensively to write values into shared memory and read them out. The exact amount of Marshaling that we do depends on the source code we're analyzing, but our internal (unfortunately unsharable) workload is quite significant. If it would be helpful, I can look to see what the impact on our workloads are and report back. |
|
@samwgoldman Studying the impact of this PR on internal workloads would be quite useful. Does that mean Hack and Flow can build on trunk? If there's a fork that does build, it would be nice to have them in our benchmark suite. |
|
@kayceesrk Alas, we won't build in trunk at the moment due to, at least, our dependency on ppx_deriving. This is just the first thing I ran into when I tried. I agree it would be nice to have these in your benchmark suite. In the meantime, it seems like it shouldn't be too hard for me to backport this, although perhaps the results would not be as meaningful. |
Not much has changed in |
|
I used the red-black tree example from John Whitington's book, "More OCaml: Algorithms, Methods & Diversions" to create a red-black tree (https://gist.github.com/shakthimaan/2c33be7724c33735c80d9356d1db6837) of height 23 with only integer data, and tested the same with Marshal.to_bytes. The executable was created using the following command from the top-level sources directory: The time utility was used to measure the resource usage: With no external_flags argument (
Using
Test system: Parabola GNU/Linux-libre system running Intel (R) Core (TM) i5-6200 CPU @ 2.30 GHz with 8 GB RAM. |
|
Thanks for the numbers! For the first run (with sharing), this is a 15% runtime slowndown and a 22% increase in memory usage. (I didn't expect an impact on memory usage, I guess this is the memory consumed by the hashtable structure itself?) Unfortunately, it is hard to tell from your benchmark what is the time taken by marshalling itself, as you first build a not-so-small data structure. Could you measure the time taken by the datastructure construction for a run, so that it can be retracted from the time used to measure the slowndown? This could be done with just replacing the last two lines with let time_before = Unix.gettimeofday ()
let u = create_binary_tree 23
let time_after = Unix.gettimeofday ()
let () =
Printf.eprintf "%f\n%!" (time_after -. time_before)
let _ = Marshal.to_bytes u [] |
|
Thanks for your prompt reply! I used the following command to compile: Here are the results for
The following is the result for
|
|
The numbers now do not look so good, because it seems that most of the time in the benchmark is spent before marshalling starts. For the first benchmark for example, marshalling takes (17.44 - 16.36 = 1.08) seconds on trunk, and (19.86 - 16.27 = 3.59) seconds after the patch, so we see a 232% slowdown! |
|
Thanks for your comments. We are currently looking into profiling and optimizing the hashmap implementation. |
What was the reason for the choice of this implementation in the runtime initially? Many implementation of hashmap exists in C, do you consider comparing some of them before optimizing? |
|
I think that it is going to be very hard to compete with the current implementation using a hashtable. You need to associate metadata to pointers, the current implementation is using writing the data where the pointer is; any hashing scheme is going to be substantially slower. Given that the metadata needs to be stored on all blocks encountered during serialization, it seems intuitive that a large slowdown is expected when serializing payloads with many smaller blocks. We can talk about whether the current slowdown is "too large", but it is not clear to me that you can hope to get a much smaller slowdown with more optimization. On the other hand, a rather natural idea is the following: if the metadata is only used to preserve sharing, could your implementation at least have no overhead when To go further: it should be possible to keep using the old scheme for objects that are in the minor heap, if the Multicore-GC design still guarantees that those are not mutated concurrently. I'm not sure whether this would be worth it, given that minor heaps are expected to be small (in your example clearly most data would live in the major heap), and that this would increase implementation complexity. But for applications that use message-passing with short values, possibly created right before marshalling start, this could be another way to remove the hashtable overhead. For large payloads of small blocks on the major heap, I don't see any better approach than using an external map as with the current PR, or blocking parallel mutators. |
This is true in general. If that's the only price to pay for multicore support, let it be. This said, some tuning of the hash table might help. For example, I find it suspicious that the hash table is resized only when a chain of collisions of length 100 was encountered. A well-maintained hash table should hit on the first try 90%+ of the time. The much maligned page table (in runtime/memory.c) achieves this, if I remember correctly.
This is a good suggestion, even though
That could be, but I fear it would make the marshaling implementation awfully complex. |
I agree. The original goal of this hashtable was to minimise the number of minutes spent writing it. Apart from @shakthimaan's work here I think it's never been profiled or optimised. I imagine it could be made vastly faster. It's not obvious to me that a hashing approach need be any slower than the current implementation. The current implementation does two passes over the object graph, since it needs to restore the headers that it's mutated after it's finished marshalling. A hashtable can be discarded in bulk. |
|
Using allocated size of 60 MB, the hash table implementation is now only ~1.3 seconds slower than OCaml trunk for 16 million objects. A |
There was a problem hiding this comment.
If I understand correctly, the change you make is to stop using hashing (it was using Murmurhash before if I understand correctly) and just take the block address modulo the table size. If you tune the startup size so that your benchmark never needs to do any resizing, this is providing good performance. What about the with-resizing scenario?
For performance measurements, I think that it would be interesting to see the overhead factor (that is, (with-path time) / (without-patch time)) for the serialization part of the program, both with your fairly-high-alloc-size and also a more typical default setting (for example 1Ki or maybe 1Mi).)
I don't know much about hashing, so I am not sure whether getting rid of the hashing completely is reasonable. I understand that here the assumption is that serialized addresses will end up fairly regularly spread in the range of the modulo. A bad case would be one where many addresses of the blocks that we are serializing are the same modulo the table size. This looks like it could happen if you are serializing fixed-size structures that are consecutive in memory. (We resize if there is a chain of conflicts that is too long, but this may not help that much as the resizing just double the size, which does not change the modulo results that much. Maybe actually doing 2*size + prime_number would help?)
For example, I wonder what happens if you try to serialize an array of large-ish fixed-length strings. This could be a worst-case scenario if the table size starts as a power-of-2, and the fixed-length strings are also power-of-2. For example, serializing an array of 100 strings of size 1MiB, with an initial addrmap size of 256 or 1024.
runtime/caml/startup_aux.h
Outdated
| extern uintnat caml_init_custom_minor_ratio; | ||
| extern uintnat caml_init_custom_minor_max_bsz; | ||
| extern uintnat caml_trace_level; | ||
| extern uintnat caml_init_alloc_size; |
There was a problem hiding this comment.
If this variable is only meant for addrmap serialization tables, maybe the name could be more specific? What about caml_init_intern_addrmap_size?
|
"Address modulo size of table" is a terrible hash function, and 60 Mb is way too much space overhead. You need to combine a decent hash function with a decent dynamic resizing strategy. That's all explained in textbooks such as TAOCP volume 3. For an example of implementation, look no further than the OCaml runtime system, file runtime/memory.c, functions caml_page_table_{lookup,resize,modify}. |
|
runtime/addrmap.c
Outdated
| break; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Could the whole if (...) { ... } else while (1) { ... } be rewritten as a do { ... } while(1) block, without the code duplication within the conditional?
| } | ||
| if (t->entries[h].key == key) { | ||
| return &t->entries[h].value; | ||
| } |
There was a problem hiding this comment.
If I understand correctly, the lookup function will create a slot for the key if it didn't previously exist (this is what the t->entries[h].key == ADDRMAP_INVALID_KEY conditional does), but this function never sets the .value field of the entry structure. Where are values set? From what I can read they are set at initialization time and copied at resize time, but nowhere else.
There was a problem hiding this comment.
The value is updated in extern_record_location() function in runtime/extern.c.
|
This work has been superseded by #9353, which ended up being merged -- so the feature is in, and I am closing here. |
Abstract
The OCaml runtime uses a linked data structure implementation (trail_block in trunk/runtime/extern.c) for marshaling.
Source: https://github.com/ocaml/ocaml/blob/trunk/runtime/extern.c
In OCaml, the visited set is maintained by mutating the header and the first fields of the objects being marshaled in place and then rolling back the changes after marshaling. This is incompatible with concurrent mutators and GC threads. Hence, Multicore ocaml uses an external data structure i.e, the hashmap (addrmap in byterun/caml/addrmap.h), and does not modify the objects which are being marshaled.
Source: https://github.com/ocaml-multicore/ocaml-multicore/blob/master/byterun/caml/addrmap.h
The goal is to port the use of the hashmap from Multicore OCaml to OCaml runtime. The marshalling tests for both OCaml and Multicore OCaml are present in testsuite/tests/lib-marshal directory.
Changes
PR: Use hashmap for marshaling
(Stephen Dolan, KC Sivaramakrishnan)
Copied runtime/addrmap.c and runtime/caml/addrmap.h.
Add addrmap.h and addrmap.c to runtime/.depend by including them in
runtime/Makefile.
Defined Is_power_of_2() function in runtime/caml/misc.h.
Updated runtime/extern.c to include addrmap.h and to use addrmap.c
functions.
Tests
After porting the changes from Multicore OCaml to Ocaml trunk, there
are no errors or failing tests:
The compilation time for
make worldandmake world.optfor bothtrunk and multicore/use-addrmap-hash-table-for-marshaling branches are
provided below for reference:
Test system: Parabola GNU/Linux-libre on Intel(R) Core(TM) i5-6200 CPU @ 2.30GHz
The http://bench2.ocamllabs.io:8083 tests do not use a lot of marshaling, but, here is the value of
maxrss_kBfrom the test suite run for both trunk and multicore/use-addrmap-hash-table-for-marshaling branch:Source: Comparison of multicore/use-addrmap-hash-table-for-marshaling with vanilla OCaml trunk