Skip to content

Make Minor_heap_max and Max_domains OCAMLRUNPARAM options#10955

Closed
sabine wants to merge 22 commits intoocaml:trunkfrom
sabine:minor_heap_max_domains_OCAMLRUNPARAM_795
Closed

Make Minor_heap_max and Max_domains OCAMLRUNPARAM options#10955
sabine wants to merge 22 commits intoocaml:trunkfrom
sabine:minor_heap_max_domains_OCAMLRUNPARAM_795

Conversation

@sabine
Copy link
Copy Markdown
Contributor

@sabine sabine commented Jan 27, 2022

This implements ocaml-multicore/ocaml-multicore#795.

Minor_heap_max and Max_domains become runtime parameters set through OCAMLRUNPARAM. Manual entries added.

Multiple fixed-size static arrays of length Max_domains turned into array pointers. Allocations happen at program startup based on caml_params->max_domains.

New default of 16 for Max_domains is proposed. Default for Minor_heap_max not changed from 256k.

Valgrind runs with OCAMLRUNPARAM="d=15", but fails with default settings: Fatal error: Not enough heap memory to start up.

For running AFL, we need to use a lower number of domains to stay below AFL's 50MB virtual memory limit, or use the -m none option. Updated the manual to add this information. Modified AFL testcase to use OCAMLRUNPARAMS instead of the -m none option.

Discuss:

  • Are the proposed default values reasonable?
  • Deallocation - do we need to deallocate, and if so, where do we deallocate?

@sabine sabine marked this pull request as draft January 27, 2022 14:37
case 'M': scanmult (opt, &params.init_custom_major_ratio); break;
case 'm': scanmult (opt, &params.init_custom_minor_ratio); break;
case 'n': scanmult (opt, &params.init_custom_minor_max_bsz); break;
case 'N': scanmult (opt, &params.minor_heap_max_wsz); break;
Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an interaction between N and s parameters? For soundness, s <= N must hold. I see three options here:

  1. The runtime can raise an exception if s <= N doesn't hold. Makes it difficult for users.
  2. Silently bump N to be at least as much as s. Choose the default values such as N >= s.
  3. Remove the N parameter altogether and only have s. Interpret N == s. The downside here is that the size of the minor heap cannot be programmatically increased during runtime from its initial value since the minor heap is already at the maximum size. It can only be decreased. FWIW, depending on the default value for N in option 2, it will also suffer from the same problem.

If it is not common to increase the size of the minor heap during execution, I would go for 3 as it has one fewer option, and is less confusing for the users.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damiendoligez might have an opinion on this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching the Internet, I can't seem to find any discussions where people suggest increasing the minor heap size during execution. I only saw people recommending to increase the minor heap size using the s option.

Are there other ways we could go about checking? Is it possible to search all opam packages for code that increases the size of the minor heap during execution?

Clearly, 3 has the least potential for confusing users.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a program that has command-line parameters to tune the gc control knobs (see https://github.com/Gbury/dolmen/blob/7cd59bbeaa536056d19facdbdc724b104e891732/src/bin/options.ml#L252-L262 ,
https://github.com/Gbury/dolmen/blob/7cd59bbeaa536056d19facdbdc724b104e891732/src/bin/options.ml#L316 , and https://github.com/Gbury/dolmen/blob/7cd59bbeaa536056d19facdbdc724b104e891732/src/bin/options.ml#L369-L401 ), but i'ts mainly because I find it easier to use command line arguments rather than env variables, and because once upon a time, I ran some benchmarks to tweak the default parameters a bit. On that note, there's also at least Coq that increases the minor heap size at startup (but I think it falls in the definition of during execution), see https://github.com/coq/coq/blob/07537426ad106c721e67d513c6db2e9047131e7b/sysinit/coqinit.ml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompCert also increases minor heap size at start-up:
https://github.com/AbsInt/CompCert/blob/85b1c4091e13dec13fe03f28e81b256c60f9f7ef/driver/Driver.ml#L403-L406

You could OPAM-grep for Gc.set: I suspect many uses of this function tweak the minor heap size.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running the OPAM-grep, and so far, it looks like you are correct:

Using Gc.set to set the minor heap size is something that people commonly do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing heap sizes throughout runtime has been proposed/discussed too: https://sympa.inria.fr/sympa/arc/caml-list/2010-11/msg00323.html. I have sometimes had success with similar techniques, but would not suggest them as a general-purpose default.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Jan 27, 2022

The domain_parallel_spawn_burn tests creates lots of domains that live at the same time, which is causing the CI failure.

You may want to increase the Max_domains value for this test. Max_domains value of 32 should be sufficient as the slots will be reused. See

let test_parallel_spawn () =
for i = 1 to 20 do
let a = Array.init 25 (fun _ -> Domain.spawn (fun () -> burn [0])) in
for j = 0 to 24 do
join a.(j)
done
done

You can specify the OCAMLRUNPARAM for a single test in the TEST stanza like so:

ocamlrunparam += ",b=1"

@@ -3,6 +3,7 @@
include unix
** bytecode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you may need a separate ocamlrunparam for bytecode too? @shindere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locally, the test case passes 20 times in a row

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to multicore hacking :-) The github actions VMs have quite good at smoking out non-determinism bugs. I still don't know whether the ocamlrunparam as written applies only to the native runs or both of them.

@xavierleroy
Copy link
Copy Markdown
Contributor

I was thinking about these pesky programs that change minor heap size programmatically. Maybe they are wrong (the runtime system knows best what the good size is) and should be told to stop doing that, or their requests to change size should be ignored. But assuming we want to keep supporting this idiom, here is one possible way to go about it.

All minor heaps should reside in a range of memory addresses that contain no other data, so that the "is young" test can be implemented as two pointer comparisons. Let's call this range of addresses "the minor area".

Currently, "the minor area" is structured as N blocks of size S, where N is the max number of domains and S the max size for minor heaps, as fixed once and for all (before this PR) or determined when the runtime system starts (this PR). But many other arrangements can be considered, including dynamic allocation within the minor area...

The one parameter that must be determined when the runtime system starts is the total size of the minor area, so that an address range of this size can be reserved. After this, we can have a trivial dynamic allocator that implements malloc and free requests. These caml_minor_area_alloc and caml_minor_area_free functions can then be used to allocate minor heaps when domains are created, and to resize the minor heap for the calling domain exactly like OCaml 4 does it today (do minor GC, free minor heap, allocate new minor heap).

Additional benefit: with a given total minor area size, we can have one or a few domains with a huge minor heap, or a boatload of domains with small minor heaps... No need to decide in advance.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 28, 2022

Additional benefit: with a given total minor area size, we can have one or a few domains with a huge minor heap, or a boatload of domains with small minor heaps... No need to decide in advance.

I understand that @sadiqj has experimented with various allocation strategies for the minor heaps that would be more flexible, but it was found to perform noticeably worse than the current approach, for unclear reasons (possibly NUMA effects). See the post-mortem at https://github.com/ocaml-multicore/ocaml-multicore/wiki/Domain-Local-Allocation-Buffers-Addendum

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Jan 28, 2022

I understand that @sadiqj has experimented with various allocation strategies for the minor heaps that would be more flexible

I forgot about that, thanks for the pointer! However, his schema is more ambitious than mine. With his approach, chunks of the minor area could be acquired by a domain, then released at the next minor GC, then reacquired by another domain. This is a very clever way to address the case of domains having drastically different allocation rates. But I'm not surprised NUMA systems react badly to chunks of memory being "moved" from one core to another this way.

My proposal is much much more modest: every domain still owns a fixed subset of the minor area; it's just that this subset is determined dynamically at domain creation time rather than being partitioned a priori. Maybe that's still well suited to NUMA ?

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 28, 2022

My proposal is much much more modest: every domain still owns a fixed subset of the minor area; it's just that this subset is determined dynamically at domain creation time rather than being partitioned a priori. Maybe that's still well suited to NUMA ?

If I understand this correctly there may still be an issue after minor heap resizing that a domain ends up with some memory that belonged to a different domain prior to resize and may be on a different core/socket.

One option to avoid this (if we expect resizing minor heaps to be an infrequent operation) might be to munmap and mmap the minor heap?

I also wasn't 100% sold that the DLABs work @gasche mentioned was all NUMA issues, there seemed to be some weird cache effects going on that we didn't understand.

@xavierleroy
Copy link
Copy Markdown
Contributor

If I understand this correctly there may still be an issue after minor heap resizing that a domain ends up with some memory that belonged to a different domain prior to resize and may be on a different core/socket.

I would expect minor heap resizing to occur very rarely (like, once at the beginning of the program), so this may not be much of a concern.

On the other hand, unmapping then remapping could be a good opportunity to change mapping flags and e.g. force the memory to be allocated for good? (I'm probably talking nonsense here).

On the third hand, I'm not sure that unmapping then remapping always succeeds...

free_domain_ml_values(p.ml_values);
caml_failwith("failed to allocate domain");
caml_failwith("failed to allocate domain. "
"You can set the 'Max_domains' OCAMLRUNPARAM parameter "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kayceesrk
Copy link
Copy Markdown
Contributor

caml_minor_area_alloc and caml_minor_area_free functions can then be used to allocate minor heaps when domains are created, and to resize the minor heap for the calling domain exactly like OCaml 4 does it today (do minor GC, free minor heap, allocate new minor heap).

One concern I have with such a dynamic scheme is that we may observe non-deterministic domain creation failures. For example, consider for the sake of argument that we create a 6 MB minor heap area to be shared among all of the domains. Let us assume that each domain starts with a default 2 MB minor heap. Consider the following program:

let main () =
  let d1 = Domain.spawn (fun _ -> 
    Gc.set { (Gc.get()) with Gc.minor_heap_size = 4194304 (* 4M *) })
  in
  let d2 = Domain.spawn (fun _ -> ()) in
  Domain.join d1;
  Domain.join d2

The spawn of d2 may fail non-deterministically depending on when d1 gets to resize the minor heap. If d2 is spawned before d1 gets to resize, then the resize request will fail. I suspect that it may end up being difficult to debug the program if many domains enter and leave.

This PR proposes a simpler API. You can have Max_domains number of domains running simultaneously. If domain creation fails, then the error message directs the user to look at this option. We may consider doing something similar for resize request. Currently, it is a fatal error if the resize request fails, but we can direct the user to look at the maximum minor heap size OCAMLRUNPARAM option.

It is a bit unsatisfying that Coq will fatal error if the default Minor_heap_max is smaller than 256 MB in this PR. Coq will have to be run with OCAMLRUNPARAM="N=32M". Perhaps option number 2

Silently bump N to be at least as much as s. Choose the default values such as N >= s.

is not so bad with N=32M and s=256k.


Overall, I feel that this PR is proposing useful improvements

  • No fixed limit for minor heap size and maximum domains
  • Valgrind can work

I am keen to get to a state where the PR is forwards compatible with the experiments that we would like to do on alternate minor heap designs, and provides a simple API to the users.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Jan 29, 2022

The one parameter that must be determined when the runtime system starts is the total size of the minor area, so that an address range of this size can be reserved.

Somewhat naive question: why does the address range need to be reserved once and for all at the beginning? I would have thought that a new larger minor area could be mapped dynamically as the required size grows, doing a minor collection when it is, since there aren't any pointers into the minor area that survive a minor collection.

@kayceesrk
Copy link
Copy Markdown
Contributor

Somewhat naive question: why does the address range need to be reserved once and for all at the beginning? I would have thought that a new larger minor area could be mapped dynamically as the required size grows, doing a minor collection when it is, since there aren't any pointers into the minor area that survive a minor collection.

I agree. Resize is a rare operation, and its performance is not a particular concern. On a resize request, we may perform a minor collection, and then in a stop-the-world section munmap the previous minor heap area, and reserve a larger area. Then the first thing each domain does is to commit the memory for its minor heap, and proceed with mutator execution. This will ensure the same locality benefits that we have currently.

Potentially, the same idea can work for the case when there are no free domain slots during a domain spawn.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 29, 2022

On a resize request, we may perform a minor collection, and then in a stop-the-world section munmap the previous minor heap area, and reserve a larger area. Then the first thing each domain does is to commit the memory for its minor heap, and proceed with mutator execution. This will ensure the same locality benefits that we have currently.

Agree. This was my suggestion earlier up in the thread, we don't necessarily need to keep the same minor heap range as long as it's done during the stw after the minor heap has been collected.

@xavierleroy
Copy link
Copy Markdown
Contributor

It is a bit unsatisfying that Coq will fatal error if the default Minor_heap_max is smaller than 256 MB in this PR. Coq will have to be run with OCAMLRUNPARAM="N=32M"

This is not acceptable. Tweaking OCAMLRUNPARAM is OK for developers but not for end-users. (Plus, Windows users don't even know how to set an environment variable.) Standalone applications such as Coq or CompCert must be able to set their run-time parameters (minor heap size, max number of domains, etc) programmatically, at the beginning of execution. It's OK if it can only be done before the first domain is spawned.

Several ways to do this have been proposed in this PR. Why don't we agree first on functionality, then on a design, and only then submit and review PRs ?

@sabine
Copy link
Copy Markdown
Contributor Author

sabine commented Jan 31, 2022

Clearly, it is unacceptable to break existing programs by requiring an environment variable to be set for them.

What if we keep the default Minor_heap_max at the original value of 2GB, so that existing programs are not affected (then, developers who need to run Valgrind or AFL can use the OCAMLRUNPARAM to lower this limit)?

It has become clear during the discussion that, moving forward, there is potential in exploring a better solution for giving the programmer control over the virtual memory reservations performed by the runtime.

I'm closing this in favor of continuing the discussion on #10971.

@sabine sabine closed this Jan 31, 2022
@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jan 31, 2022 via email

@kayceesrk
Copy link
Copy Markdown
Contributor

Thanks for the clarification @shindere.

sabine added a commit to sabine/ocaml that referenced this pull request Feb 9, 2022
sabine added a commit to sabine/ocaml that referenced this pull request Feb 9, 2022
@ctk21 ctk21 mentioned this pull request Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants