Use addrmap hash table for marshaling by shakthimaan · Pull Request #9293 · ocaml/ocaml

shakthimaan · 2020-02-10T16:18:26Z

Abstract

The OCaml runtime uses a linked data structure implementation (trail_block in trunk/runtime/extern.c) for marshaling.

/* Trail mechanism to undo forwarding pointers put inside objects */

struct trail_entry {
  value obj;    /* address of object + initial color in low 2 bits */
  value field0; /* initial contents of field 0 */
};

struct trail_block {
  struct trail_block * previous;
  struct trail_entry entries[ENTRIES_PER_TRAIL_BLOCK];
};

Source: https://github.com/ocaml/ocaml/blob/trunk/runtime/extern.c

In OCaml, the visited set is maintained by mutating the header and the first fields of the objects being marshaled in place and then rolling back the changes after marshaling. This is incompatible with concurrent mutators and GC threads. Hence, Multicore ocaml uses an external data structure i.e, the hashmap (addrmap in byterun/caml/addrmap.h), and does not modify the objects which are being marshaled.

struct addrmap_entry { value key, value; };

struct addrmap {
  struct addrmap_entry* entries;
  uintnat size;
};

Source: https://github.com/ocaml-multicore/ocaml-multicore/blob/master/byterun/caml/addrmap.h

The goal is to port the use of the hashmap from Multicore OCaml to OCaml runtime. The marshalling tests for both OCaml and Multicore OCaml are present in testsuite/tests/lib-marshal directory.

Changes

PR: Use hashmap for marshaling
(Stephen Dolan, KC Sivaramakrishnan)

Copied runtime/addrmap.c and runtime/caml/addrmap.h.

Add addrmap.h and addrmap.c to runtime/.depend by including them in
runtime/Makefile.

Defined Is_power_of_2() function in runtime/caml/misc.h.

Updated runtime/extern.c to include addrmap.h and to use addrmap.c
functions.

Tests

After porting the changes from Multicore OCaml to Ocaml trunk, there
are no errors or failing tests:

Summary:
  2642 tests passed
  40 tests skipped
   0 tests failed
  100 tests not started (parent test skipped or failed)
   0 unexpected errors
2782 tests considered

The compilation time for make world and make world.opt for both
trunk and multicore/use-addrmap-hash-table-for-marshaling branches are
provided below for reference:

Branch	trunk	hash-table-for-marshaling
make world	real 2m54.526s	real 2m57.806s
	user 2m41.816s	user 2m45.053s
	sys 0m11.904s	sys 0m12.345s
make world.opt	real 5m26.222s	real 5m31.342s
	user 5m0.913s	user 5m5.496s
	sys 0m24.386s	sys 0m24.972s

Test system: Parabola GNU/Linux-libre on Intel(R) Core(TM) i5-6200 CPU @ 2.30GHz

The http://bench2.ocamllabs.io:8083 tests do not use a lot of marshaling, but, here is the value of maxrss_kB from the test suite run for both trunk and multicore/use-addrmap-hash-table-for-marshaling branch:

Metric	trunk	hash-table-for-marshaling
count	192	192
mean	24702.625	24701.5
std	72617.01	72612.89
min	2476	2476
25%	4524	4524
50%	5212	5212
75%	9531	9531
max	730868	730868

Source: Comparison of multicore/use-addrmap-hash-table-for-marshaling with vanilla OCaml trunk

dra27 · 2020-02-11T09:14:52Z

(I just pushed an adjustment which should allow AppVeyor to complete)

dra27

A few quick housekeeping comments on the headers, rather than a full review I'm afraid

runtime/caml/misc.h

runtime/caml/addrmap.h

runtime/addrmap.c

gasche · 2020-02-11T14:55:09Z

I find it hard to draw any conclusion from the performance numbers given here. If I understood correctly, the benchmark suite does not test marshalling much (so the fact that the numbers are identical is not very informative). The compiler does rely on marshalling a bit (to create and read .cm{i,o,x} files), but the performance results are hard to interpret due to potential measurement noise. (Are those just one run of each, or a result aggregated from several runs?).

It would be useful to have a microbenchmark to measure marshalling speed regressions, so that people can get a quick estimate of potential performance regression for their workload if they know the fraction of time spent in marshalling. It may also be interesting to consider adding marshalling examples to the benchmark suite, for example a simple message-passing program (that just passes messages to itself for benchmarking purposes).

dra27 · 2020-02-11T15:03:14Z

opam's state-loading mechanism can be turned fairly quickly into a micro-benchmark (see https://github.com/ocaml/opam/blob/master/src/state/opamRepositoryState.mli#L20). It's at least an important benchmark as it's used on virtually every opam anyone runs!

samwgoldman · 2020-02-11T23:51:31Z

Hack (https://hacklang.org/) and Flow (http://flow.org/) both use the Marshal module extensively to write values into shared memory and read them out. The exact amount of Marshaling that we do depends on the source code we're analyzing, but our internal (unfortunately unsharable) workload is quite significant.

If it would be helpful, I can look to see what the impact on our workloads are and report back.

kayceesrk · 2020-02-12T03:19:04Z

@samwgoldman Studying the impact of this PR on internal workloads would be quite useful. Does that mean Hack and Flow can build on trunk? If there's a fork that does build, it would be nice to have them in our benchmark suite.

samwgoldman · 2020-02-12T04:38:47Z

@kayceesrk Alas, we won't build in trunk at the moment due to, at least, our dependency on ppx_deriving. This is just the first thing I ran into when I tried.

I agree it would be nice to have these in your benchmark suite. In the meantime, it seems like it shouldn't be too hard for me to backport this, although perhaps the results would not be as meaningful.

kayceesrk · 2020-02-12T06:15:32Z

although perhaps the results would not be as meaningful.

Not much has changed in extern.c recently. So the results should be meaningful. Reg ppx, what we end up doing in sandmark is to commit the output of ppx pass so that we can compile the output program with different compiler versions. Perhaps something like this might work.

shakthimaan · 2020-02-16T16:30:50Z

I used the red-black tree example from John Whitington's book, "More OCaml: Algorithms, Methods & Diversions" to create a red-black tree (https://gist.github.com/shakthimaan/2c33be7724c33735c80d9356d1db6837) of height 23 with only integer data, and tested the same with Marshal.to_bytes. The executable was created using the following command from the top-level sources directory:

$ ./runtime/ocamlrun ./ocamlopt -nostdlib -I stdlib file.ml

The time utility was used to measure the resource usage:

$ /usr/bin/time -v ./a.out

With no external_flags argument (Marshal.to_bytes data []), these are the results that differed between trunk (82d7971...) and the hash-table-for-marshaling branch:

Metric	trunk	hash-table-for-marshaling
User time (seconds)	16.97	19.16
System time (seconds)	0.54	0.79
Maximum resident set size (kbytes)	1449320	1769960
Minor (reclaiming a frame) page faults	363989	559518
Involuntary context switches	47	85

Using Marshal.to_bytes data [Marshal.No_Sharing], the results that differed are provided below for reference:

Metric	trunk	hash-table-for-marshaling
User time (seconds)	16.8	19.25
System time (seconds)	0.49	0.76
Maximum resident set size (kbytes)	1186964	1774652
Minor (reclaiming a frame) page faults	298391	559520
Involuntary context switches	82	49

Test system: Parabola GNU/Linux-libre system running Intel (R) Core (TM) i5-6200 CPU @ 2.30 GHz with 8 GB RAM.

gasche · 2020-02-16T16:46:52Z

Thanks for the numbers! For the first run (with sharing), this is a 15% runtime slowndown and a 22% increase in memory usage.

(I didn't expect an impact on memory usage, I guess this is the memory consumed by the hashtable structure itself?)

Unfortunately, it is hard to tell from your benchmark what is the time taken by marshalling itself, as you first build a not-so-small data structure. Could you measure the time taken by the datastructure construction for a run, so that it can be retracted from the time used to measure the slowndown? This could be done with just replacing the last two lines with

let time_before = Unix.gettimeofday ()
let u = create_binary_tree 23
let time_after = Unix.gettimeofday ()
let () =
  Printf.eprintf "%f\n%!" (time_after -. time_before)
let _ = Marshal.to_bytes u []

shakthimaan · 2020-02-16T18:04:36Z

Thanks for your prompt reply! I used the following command to compile:

$ ./runtime/ocamlrun ./ocamlopt -nostdlib -I stdlib -I otherlibs/unix unix.cmxa file.ml

Here are the results for Marshal.to_bytes data []:

Metric	trunk	hash-table-for-marshaling
Printf.eprintf	16.360950	16.274193
User time (seconds)	16.9	19.04
System time (seconds	0.51	0.78
Elapsed (wall clock) time	0:17.44	0:19.86

The following is the result for Marshal.to_bytes data [Marshal.No_Sharing]:

Metric	trunk	hash-table-for-marshaling
Printf.eprintf	16.486907	16.454468
User time (seconds)	16.82	19.32
System time (seconds	0.41	0.78
Elapsed (wall clock) time	0:17.27	0:20.14

gasche · 2020-02-16T18:40:27Z

The numbers now do not look so good, because it seems that most of the time in the benchmark is spent before marshalling starts. For the first benchmark for example, marshalling takes (17.44 - 16.36 = 1.08) seconds on trunk, and (19.86 - 16.27 = 3.59) seconds after the patch, so we see a 232% slowdown!

shakthimaan · 2020-02-17T15:35:04Z

Thanks for your comments. We are currently looking into profiling and optimizing the hashmap implementation.

bobot · 2020-02-17T15:37:47Z

We are currently looking into profiling and optimizing the hashmap implementation.

What was the reason for the choice of this implementation in the runtime initially? Many implementation of hashmap exists in C, do you consider comparing some of them before optimizing?

gasche · 2020-02-17T16:30:54Z

I think that it is going to be very hard to compete with the current implementation using a hashtable. You need to associate metadata to pointers, the current implementation is using writing the data where the pointer is; any hashing scheme is going to be substantially slower. Given that the metadata needs to be stored on all blocks encountered during serialization, it seems intuitive that a large slowdown is expected when serializing payloads with many smaller blocks. We can talk about whether the current slowdown is "too large", but it is not clear to me that you can hope to get a much smaller slowdown with more optimization.

On the other hand, a rather natural idea is the following: if the metadata is only used to preserve sharing, could your implementation at least have no overhead when No_sharing is used? I suspect that it is a performance bug in the current PR that you store pointers in the table even when No_sharing is set; fixing it (if it indeed is correct) should remove the overhead in No_sharing mode.

To go further: it should be possible to keep using the old scheme for objects that are in the minor heap, if the Multicore-GC design still guarantees that those are not mutated concurrently. I'm not sure whether this would be worth it, given that minor heaps are expected to be small (in your example clearly most data would live in the major heap), and that this would increase implementation complexity. But for applications that use message-passing with short values, possibly created right before marshalling start, this could be another way to remove the hashtable overhead.

For large payloads of small blocks on the major heap, I don't see any better approach than using an external map as with the current PR, or blocking parallel mutators.

xavierleroy · 2020-02-17T17:48:37Z

I think that it is going to be very hard to compete with the current implementation using a hashtable. You need to associate metadata to pointers, the current implementation is using writing the data where the pointer is; any hashing scheme is going to be substantially slower.

This is true in general. If that's the only price to pay for multicore support, let it be. This said, some tuning of the hash table might help. For example, I find it suspicious that the hash table is resized only when a chain of collisions of length 100 was encountered. A well-maintained hash table should hit on the first try 90%+ of the time. The much maligned page table (in runtime/memory.c) achieves this, if I remember correctly.

if the metadata is only used to preserve sharing, could your implementation at least have no overhead when No_sharing is used?

This is a good suggestion, even though No_sharing is often unusable because it produces excessively big marshaled data.

it should be possible to keep using the old scheme for objects that are in the minor heap, if the Multicore-GC design still guarantees that those are not mutated concurrently.

That could be, but I fear it would make the marshaling implementation awfully complex.

stedolan · 2020-02-18T10:48:06Z

This said, some tuning of the hash table might help. For example, I find it suspicious that the hash table is resized only when a chain of collisions of length 100 was encountered

I agree. The original goal of this hashtable was to minimise the number of minutes spent writing it. Apart from @shakthimaan's work here I think it's never been profiled or optimised. I imagine it could be made vastly faster.

It's not obvious to me that a hashing approach need be any slower than the current implementation. The current implementation does two passes over the object graph, since it needs to restore the headers that it's mutated after it's finished marshalling. A hashtable can be discarded in bulk.

shakthimaan · 2020-02-28T10:09:06Z

Using allocated size of 60 MB, the hash table implementation is now only ~1.3 seconds slower than OCaml trunk for 16 million objects. A caml_init_alloc_size has also been included as a runtime parameter.

gasche

If I understand correctly, the change you make is to stop using hashing (it was using Murmurhash before if I understand correctly) and just take the block address modulo the table size. If you tune the startup size so that your benchmark never needs to do any resizing, this is providing good performance. What about the with-resizing scenario?

For performance measurements, I think that it would be interesting to see the overhead factor (that is, (with-path time) / (without-patch time)) for the serialization part of the program, both with your fairly-high-alloc-size and also a more typical default setting (for example 1Ki or maybe 1Mi).)

I don't know much about hashing, so I am not sure whether getting rid of the hashing completely is reasonable. I understand that here the assumption is that serialized addresses will end up fairly regularly spread in the range of the modulo. A bad case would be one where many addresses of the blocks that we are serializing are the same modulo the table size. This looks like it could happen if you are serializing fixed-size structures that are consecutive in memory. (We resize if there is a chain of conflicts that is too long, but this may not help that much as the resizing just double the size, which does not change the modulo results that much. Maybe actually doing 2*size + prime_number would help?)

For example, I wonder what happens if you try to serialize an array of large-ish fixed-length strings. This could be a worst-case scenario if the table size starts as a power-of-2, and the fixed-length strings are also power-of-2. For example, serializing an array of 100 strings of size 1MiB, with an initial addrmap size of 256 or 1024.

gasche · 2020-02-28T10:21:25Z

runtime/caml/startup_aux.h

 extern uintnat caml_init_custom_minor_ratio;
 extern uintnat caml_init_custom_minor_max_bsz;
 extern uintnat caml_trace_level;
+extern uintnat caml_init_alloc_size;


If this variable is only meant for addrmap serialization tables, maybe the name could be more specific? What about caml_init_intern_addrmap_size?

xavierleroy · 2020-02-28T15:05:49Z

"Address modulo size of table" is a terrible hash function, and 60 Mb is way too much space overhead. You need to combine a decent hash function with a decent dynamic resizing strategy. That's all explained in textbooks such as TAOCP volume 3. For an example of implementation, look no further than the OCaml runtime system, file runtime/memory.c, functions caml_page_table_{lookup,resize,modify}.

shakthimaan · 2020-03-05T14:19:44Z

The addrmap code has now been updated to use Multiplicative Fibonacci hashing similar to caml_page_table{lookup, resize} implementation from runtime/memory.c.
It is now only ~2s slower than OCaml trunk for 16 million objects for marshaling a binary tree of height 23.
The hash table is doubled only when the load factor reaches 90%.

gasche · 2020-03-05T20:49:24Z

runtime/addrmap.c

+            break;
+          }
+        }
+      }


Could the whole if (...) { ... } else while (1) { ... } be rewritten as a do { ... } while(1) block, without the code duplication within the conditional?

Updated in commit 24dc63b.

gasche · 2020-03-05T20:53:09Z

runtime/addrmap.c

+    }
+    if (t->entries[h].key == key) {
+      return &t->entries[h].value;
+    }


If I understand correctly, the lookup function will create a slot for the key if it didn't previously exist (this is what the t->entries[h].key == ADDRMAP_INVALID_KEY conditional does), but this function never sets the .value field of the entry structure. Where are values set? From what I can read they are set at initialization time and copied at resize time, but nowhere else.

The value is updated in extern_record_location() function in runtime/extern.c.

xavierleroy · 2020-03-08T09:23:39Z

I got tired of the back-and-forth on this PR, so I wrote my own implementation: #9353.

I move to close this PR and concentrate the reviewing on #9353 instead.

gasche · 2020-04-17T09:49:47Z

This work has been superseded by #9353, which ended up being merged -- so the feature is in, and I am closing here.
Thanks for the initial work and performance investigation!

shakthimaan and others added 2 commits February 10, 2020 21:29

Use addrmap hash table for marshaling

8e3f181

Restore compatibility for older MSVC

e7dd802

dra27 mentioned this pull request Feb 11, 2020

Aid compatibility with older Microsoft C compilers #1176

Merged

Updated Changes file

a849d19

dra27 reviewed Feb 11, 2020

View reviewed changes

runtime/caml/misc.h Show resolved Hide resolved

runtime/caml/addrmap.h Outdated Show resolved Hide resolved

runtime/caml/addrmap.h Show resolved Hide resolved

runtime/addrmap.c Show resolved Hide resolved

Use CAML_INTERNALS as suggested by dra27 and trim spaces

4e084b5

avsm added the multicore-prerequisite label Feb 12, 2020

shakthimaan added 2 commits February 28, 2020 11:18

Use caml_init_alloc_size and key in addrmap location

3b2c5ca

Remove unused code

55058db

gasche reviewed Feb 28, 2020

View reviewed changes

Use caml_init_intern_addrmap_size

a7e3f07

Use caml_page_table logic and Fibonacci hashing

166a011

shakthimaan added 4 commits March 5, 2020 11:38

move sz declaration for ISO C90

f7dbec0

Remove unused display_addrmap() function

da9c527

Prefix functions with caml_

7d944ad

Make builds for older MSVC happy

a0484dd

gasche reviewed Mar 5, 2020

View reviewed changes

Use do ... while(1) loop for caml_addrmap_resize()

24dc63b

gasche closed this Apr 17, 2020

Conversation

shakthimaan commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Changes

Tests

Uh oh!

dra27 commented Feb 11, 2020

Uh oh!

dra27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gasche commented Feb 11, 2020

Uh oh!

dra27 commented Feb 11, 2020

Uh oh!

samwgoldman commented Feb 11, 2020

Uh oh!

kayceesrk commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samwgoldman commented Feb 12, 2020

Uh oh!

kayceesrk commented Feb 12, 2020

Uh oh!

shakthimaan commented Feb 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Feb 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shakthimaan commented Feb 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Feb 16, 2020

Uh oh!

shakthimaan commented Feb 17, 2020

Uh oh!

bobot commented Feb 17, 2020

Uh oh!

gasche commented Feb 17, 2020

Uh oh!

xavierleroy commented Feb 17, 2020

Uh oh!

stedolan commented Feb 18, 2020

Uh oh!

shakthimaan commented Feb 28, 2020

Uh oh!

gasche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Feb 28, 2020

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Feb 28, 2020

Uh oh!

shakthimaan commented Mar 5, 2020

Uh oh!

gasche Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

shakthimaan Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

shakthimaan Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Mar 8, 2020

Uh oh!

gasche commented Apr 17, 2020

Uh oh!

shakthimaan commented Feb 10, 2020 •

edited

Loading

kayceesrk commented Feb 12, 2020 •

edited

Loading

shakthimaan commented Feb 16, 2020 •

edited

Loading

gasche commented Feb 16, 2020 •

edited

Loading

shakthimaan commented Feb 16, 2020 •

edited

Loading

gasche left a comment •

edited

Loading

shakthimaan Mar 6, 2020 •

edited

Loading