Skip to content

Use addrmap hash table for marshaling#9293

Closed
shakthimaan wants to merge 13 commits intoocaml:trunkfrom
shakthimaan:multicore-release/use-addrmap-hash-table-for-marshaling
Closed

Use addrmap hash table for marshaling#9293
shakthimaan wants to merge 13 commits intoocaml:trunkfrom
shakthimaan:multicore-release/use-addrmap-hash-table-for-marshaling

Conversation

@shakthimaan
Copy link
Copy Markdown
Contributor

@shakthimaan shakthimaan commented Feb 10, 2020

Abstract

The OCaml runtime uses a linked data structure implementation (trail_block in trunk/runtime/extern.c) for marshaling.

/* Trail mechanism to undo forwarding pointers put inside objects */

struct trail_entry {
  value obj;    /* address of object + initial color in low 2 bits */
  value field0; /* initial contents of field 0 */
};

struct trail_block {
  struct trail_block * previous;
  struct trail_entry entries[ENTRIES_PER_TRAIL_BLOCK];
};

Source: https://github.com/ocaml/ocaml/blob/trunk/runtime/extern.c

In OCaml, the visited set is maintained by mutating the header and the first fields of the objects being marshaled in place and then rolling back the changes after marshaling. This is incompatible with concurrent mutators and GC threads. Hence, Multicore ocaml uses an external data structure i.e, the hashmap (addrmap in byterun/caml/addrmap.h), and does not modify the objects which are being marshaled.

struct addrmap_entry { value key, value; };

struct addrmap {
  struct addrmap_entry* entries;
  uintnat size;
};

Source: https://github.com/ocaml-multicore/ocaml-multicore/blob/master/byterun/caml/addrmap.h

The goal is to port the use of the hashmap from Multicore OCaml to OCaml runtime. The marshalling tests for both OCaml and Multicore OCaml are present in testsuite/tests/lib-marshal directory.

Changes

  • PR: Use hashmap for marshaling
    (Stephen Dolan, KC Sivaramakrishnan)

    Copied runtime/addrmap.c and runtime/caml/addrmap.h.

    Add addrmap.h and addrmap.c to runtime/.depend by including them in
    runtime/Makefile.

    Defined Is_power_of_2() function in runtime/caml/misc.h.

    Updated runtime/extern.c to include addrmap.h and to use addrmap.c
    functions.

Tests

After porting the changes from Multicore OCaml to Ocaml trunk, there
are no errors or failing tests:

Summary:
  2642 tests passed
  40 tests skipped
   0 tests failed
  100 tests not started (parent test skipped or failed)
   0 unexpected errors
2782 tests considered

The compilation time for make world and make world.opt for both
trunk and multicore/use-addrmap-hash-table-for-marshaling branches are
provided below for reference:

Branch trunk hash-table-for-marshaling
make world real 2m54.526s real 2m57.806s
user 2m41.816s user 2m45.053s
sys 0m11.904s sys 0m12.345s
make world.opt real 5m26.222s real 5m31.342s
user 5m0.913s user 5m5.496s
sys 0m24.386s sys 0m24.972s

Test system: Parabola GNU/Linux-libre on Intel(R) Core(TM) i5-6200 CPU @ 2.30GHz

The http://bench2.ocamllabs.io:8083 tests do not use a lot of marshaling, but, here is the value of maxrss_kB from the test suite run for both trunk and multicore/use-addrmap-hash-table-for-marshaling branch:

Metric trunk hash-table-for-marshaling
count 192 192
mean 24702.625 24701.5
std 72617.01 72612.89
min 2476 2476
25% 4524 4524
50% 5212 5212
75% 9531 9531
max 730868 730868

Source: Comparison of multicore/use-addrmap-hash-table-for-marshaling with vanilla OCaml trunk

@dra27
Copy link
Copy Markdown
Member

dra27 commented Feb 11, 2020

(I just pushed an adjustment which should allow AppVeyor to complete)

Copy link
Copy Markdown
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick housekeeping comments on the headers, rather than a full review I'm afraid

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 11, 2020

I find it hard to draw any conclusion from the performance numbers given here. If I understood correctly, the benchmark suite does not test marshalling much (so the fact that the numbers are identical is not very informative). The compiler does rely on marshalling a bit (to create and read .cm{i,o,x} files), but the performance results are hard to interpret due to potential measurement noise. (Are those just one run of each, or a result aggregated from several runs?).

It would be useful to have a microbenchmark to measure marshalling speed regressions, so that people can get a quick estimate of potential performance regression for their workload if they know the fraction of time spent in marshalling. It may also be interesting to consider adding marshalling examples to the benchmark suite, for example a simple message-passing program (that just passes messages to itself for benchmarking purposes).

@dra27
Copy link
Copy Markdown
Member

dra27 commented Feb 11, 2020

opam's state-loading mechanism can be turned fairly quickly into a micro-benchmark (see https://github.com/ocaml/opam/blob/master/src/state/opamRepositoryState.mli#L20). It's at least an important benchmark as it's used on virtually every opam anyone runs!

@samwgoldman
Copy link
Copy Markdown
Contributor

Hack (https://hacklang.org/) and Flow (http://flow.org/) both use the Marshal module extensively to write values into shared memory and read them out. The exact amount of Marshaling that we do depends on the source code we're analyzing, but our internal (unfortunately unsharable) workload is quite significant.

If it would be helpful, I can look to see what the impact on our workloads are and report back.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Feb 12, 2020

@samwgoldman Studying the impact of this PR on internal workloads would be quite useful. Does that mean Hack and Flow can build on trunk? If there's a fork that does build, it would be nice to have them in our benchmark suite.

@samwgoldman
Copy link
Copy Markdown
Contributor

@kayceesrk Alas, we won't build in trunk at the moment due to, at least, our dependency on ppx_deriving. This is just the first thing I ran into when I tried.

I agree it would be nice to have these in your benchmark suite. In the meantime, it seems like it shouldn't be too hard for me to backport this, although perhaps the results would not be as meaningful.

@kayceesrk
Copy link
Copy Markdown
Contributor

although perhaps the results would not be as meaningful.

Not much has changed in extern.c recently. So the results should be meaningful. Reg ppx, what we end up doing in sandmark is to commit the output of ppx pass so that we can compile the output program with different compiler versions. Perhaps something like this might work.

@shakthimaan
Copy link
Copy Markdown
Contributor Author

shakthimaan commented Feb 16, 2020

I used the red-black tree example from John Whitington's book, "More OCaml: Algorithms, Methods & Diversions" to create a red-black tree (https://gist.github.com/shakthimaan/2c33be7724c33735c80d9356d1db6837) of height 23 with only integer data, and tested the same with Marshal.to_bytes. The executable was created using the following command from the top-level sources directory:

$ ./runtime/ocamlrun ./ocamlopt -nostdlib -I stdlib file.ml

The time utility was used to measure the resource usage:

$ /usr/bin/time -v ./a.out

With no external_flags argument (Marshal.to_bytes data []), these are the results that differed between trunk (82d7971...) and the hash-table-for-marshaling branch:

Metric trunk hash-table-for-marshaling
User time (seconds) 16.97 19.16
System time (seconds) 0.54 0.79
Maximum resident set size (kbytes) 1449320 1769960
Minor (reclaiming a frame) page faults 363989 559518
Involuntary context switches 47 85

Using Marshal.to_bytes data [Marshal.No_Sharing], the results that differed are provided below for reference:

Metric trunk hash-table-for-marshaling
User time (seconds) 16.8 19.25
System time (seconds) 0.49 0.76
Maximum resident set size (kbytes) 1186964 1774652
Minor (reclaiming a frame) page faults 298391 559520
Involuntary context switches 82 49

Test system: Parabola GNU/Linux-libre system running Intel (R) Core (TM) i5-6200 CPU @ 2.30 GHz with 8 GB RAM.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 16, 2020

Thanks for the numbers! For the first run (with sharing), this is a 15% runtime slowndown and a 22% increase in memory usage.

(I didn't expect an impact on memory usage, I guess this is the memory consumed by the hashtable structure itself?)

Unfortunately, it is hard to tell from your benchmark what is the time taken by marshalling itself, as you first build a not-so-small data structure. Could you measure the time taken by the datastructure construction for a run, so that it can be retracted from the time used to measure the slowndown? This could be done with just replacing the last two lines with

let time_before = Unix.gettimeofday ()
let u = create_binary_tree 23
let time_after = Unix.gettimeofday ()
let () =
  Printf.eprintf "%f\n%!" (time_after -. time_before)
let _ = Marshal.to_bytes u []

@shakthimaan
Copy link
Copy Markdown
Contributor Author

shakthimaan commented Feb 16, 2020

Thanks for your prompt reply! I used the following command to compile:

$ ./runtime/ocamlrun ./ocamlopt -nostdlib -I stdlib -I otherlibs/unix unix.cmxa file.ml

Here are the results for Marshal.to_bytes data []:

Metric trunk hash-table-for-marshaling
Printf.eprintf 16.360950 16.274193
User time (seconds) 16.9 19.04
System time (seconds 0.51 0.78
Elapsed (wall clock) time 0:17.44 0:19.86

The following is the result for Marshal.to_bytes data [Marshal.No_Sharing]:

Metric trunk hash-table-for-marshaling
Printf.eprintf 16.486907 16.454468
User time (seconds) 16.82 19.32
System time (seconds 0.41 0.78
Elapsed (wall clock) time 0:17.27 0:20.14

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 16, 2020

The numbers now do not look so good, because it seems that most of the time in the benchmark is spent before marshalling starts. For the first benchmark for example, marshalling takes (17.44 - 16.36 = 1.08) seconds on trunk, and (19.86 - 16.27 = 3.59) seconds after the patch, so we see a 232% slowdown!

@shakthimaan
Copy link
Copy Markdown
Contributor Author

Thanks for your comments. We are currently looking into profiling and optimizing the hashmap implementation.

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Feb 17, 2020

We are currently looking into profiling and optimizing the hashmap implementation.

What was the reason for the choice of this implementation in the runtime initially? Many implementation of hashmap exists in C, do you consider comparing some of them before optimizing?

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 17, 2020

I think that it is going to be very hard to compete with the current implementation using a hashtable. You need to associate metadata to pointers, the current implementation is using writing the data where the pointer is; any hashing scheme is going to be substantially slower. Given that the metadata needs to be stored on all blocks encountered during serialization, it seems intuitive that a large slowdown is expected when serializing payloads with many smaller blocks. We can talk about whether the current slowdown is "too large", but it is not clear to me that you can hope to get a much smaller slowdown with more optimization.

On the other hand, a rather natural idea is the following: if the metadata is only used to preserve sharing, could your implementation at least have no overhead when No_sharing is used? I suspect that it is a performance bug in the current PR that you store pointers in the table even when No_sharing is set; fixing it (if it indeed is correct) should remove the overhead in No_sharing mode.

To go further: it should be possible to keep using the old scheme for objects that are in the minor heap, if the Multicore-GC design still guarantees that those are not mutated concurrently. I'm not sure whether this would be worth it, given that minor heaps are expected to be small (in your example clearly most data would live in the major heap), and that this would increase implementation complexity. But for applications that use message-passing with short values, possibly created right before marshalling start, this could be another way to remove the hashtable overhead.

For large payloads of small blocks on the major heap, I don't see any better approach than using an external map as with the current PR, or blocking parallel mutators.

@xavierleroy
Copy link
Copy Markdown
Contributor

I think that it is going to be very hard to compete with the current implementation using a hashtable. You need to associate metadata to pointers, the current implementation is using writing the data where the pointer is; any hashing scheme is going to be substantially slower.

This is true in general. If that's the only price to pay for multicore support, let it be. This said, some tuning of the hash table might help. For example, I find it suspicious that the hash table is resized only when a chain of collisions of length 100 was encountered. A well-maintained hash table should hit on the first try 90%+ of the time. The much maligned page table (in runtime/memory.c) achieves this, if I remember correctly.

if the metadata is only used to preserve sharing, could your implementation at least have no overhead when No_sharing is used?

This is a good suggestion, even though No_sharing is often unusable because it produces excessively big marshaled data.

it should be possible to keep using the old scheme for objects that are in the minor heap, if the Multicore-GC design still guarantees that those are not mutated concurrently.

That could be, but I fear it would make the marshaling implementation awfully complex.

@stedolan
Copy link
Copy Markdown
Contributor

This said, some tuning of the hash table might help. For example, I find it suspicious that the hash table is resized only when a chain of collisions of length 100 was encountered

I agree. The original goal of this hashtable was to minimise the number of minutes spent writing it. Apart from @shakthimaan's work here I think it's never been profiled or optimised. I imagine it could be made vastly faster.

It's not obvious to me that a hashing approach need be any slower than the current implementation. The current implementation does two passes over the object graph, since it needs to restore the headers that it's mutated after it's finished marshalling. A hashtable can be discarded in bulk.

@shakthimaan
Copy link
Copy Markdown
Contributor Author

Using allocated size of 60 MB, the hash table implementation is now only ~1.3 seconds slower than OCaml trunk for 16 million objects. A caml_init_alloc_size has also been included as a runtime parameter.

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the change you make is to stop using hashing (it was using Murmurhash before if I understand correctly) and just take the block address modulo the table size. If you tune the startup size so that your benchmark never needs to do any resizing, this is providing good performance. What about the with-resizing scenario?

For performance measurements, I think that it would be interesting to see the overhead factor (that is, (with-path time) / (without-patch time)) for the serialization part of the program, both with your fairly-high-alloc-size and also a more typical default setting (for example 1Ki or maybe 1Mi).)

I don't know much about hashing, so I am not sure whether getting rid of the hashing completely is reasonable. I understand that here the assumption is that serialized addresses will end up fairly regularly spread in the range of the modulo. A bad case would be one where many addresses of the blocks that we are serializing are the same modulo the table size. This looks like it could happen if you are serializing fixed-size structures that are consecutive in memory. (We resize if there is a chain of conflicts that is too long, but this may not help that much as the resizing just double the size, which does not change the modulo results that much. Maybe actually doing 2*size + prime_number would help?)

For example, I wonder what happens if you try to serialize an array of large-ish fixed-length strings. This could be a worst-case scenario if the table size starts as a power-of-2, and the fixed-length strings are also power-of-2. For example, serializing an array of 100 strings of size 1MiB, with an initial addrmap size of 256 or 1024.

extern uintnat caml_init_custom_minor_ratio;
extern uintnat caml_init_custom_minor_max_bsz;
extern uintnat caml_trace_level;
extern uintnat caml_init_alloc_size;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this variable is only meant for addrmap serialization tables, maybe the name could be more specific? What about caml_init_intern_addrmap_size?

@xavierleroy
Copy link
Copy Markdown
Contributor

"Address modulo size of table" is a terrible hash function, and 60 Mb is way too much space overhead. You need to combine a decent hash function with a decent dynamic resizing strategy. That's all explained in textbooks such as TAOCP volume 3. For an example of implementation, look no further than the OCaml runtime system, file runtime/memory.c, functions caml_page_table_{lookup,resize,modify}.

@shakthimaan
Copy link
Copy Markdown
Contributor Author

  • The addrmap code has now been updated to use Multiplicative Fibonacci hashing similar to caml_page_table{lookup, resize} implementation from runtime/memory.c.
  • It is now only ~2s slower than OCaml trunk for 16 million objects for marshaling a binary tree of height 23.
  • The hash table is doubled only when the load factor reaches 90%.

break;
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the whole if (...) { ... } else while (1) { ... } be rewritten as a do { ... } while(1) block, without the code duplication within the conditional?

Copy link
Copy Markdown
Contributor Author

@shakthimaan shakthimaan Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit 24dc63b.

}
if (t->entries[h].key == key) {
return &t->entries[h].value;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the lookup function will create a slot for the key if it didn't previously exist (this is what the t->entries[h].key == ADDRMAP_INVALID_KEY conditional does), but this function never sets the .value field of the entry structure. Where are values set? From what I can read they are set at initialization time and copied at resize time, but nowhere else.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is updated in extern_record_location() function in runtime/extern.c.

@xavierleroy
Copy link
Copy Markdown
Contributor

I got tired of the back-and-forth on this PR, so I wrote my own implementation: #9353.

I move to close this PR and concentrate the reviewing on #9353 instead.

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 17, 2020

This work has been superseded by #9353, which ended up being merged -- so the feature is in, and I am closing here.
Thanks for the initial work and performance investigation!

@gasche gasche closed this Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants