Skip to content

OCaml multicore memory model and C (runtime, FFI, VM) #10992

@gadmm

Description

@gadmm

Even simple OCaml programs depend on bits written in C, via the runtime or the FFI, and all bytecode OCaml programs are run by an interpreter written in C. C11 has so-called “catch-fire” semantics for data races, which means that the program can (even in practice) crash or do worse things in the presence of a data race. The OCaml multicore memory model (proposed in the PLDI'18 paper) on the other hand aims to guarantee that even racy programs have well-defined behaviour.

  • To prevent the C code from threatening OCaml's local DRF property, the C code interacting with OCaml should implement the OCaml memory model.

The backwards-compatibility guarantees of multicore motivates (ideally) retrofitting this property onto existing programs, including in terms of performance preservation, which might prove challenging. If the local DRF property is (reasonably) not required of all C code in existence (which is initially going to be legacy code inherited from OCaml 4), then we can weaken the requirement as follows:

  • To prevent the catch-fire semantics of C11 from threatening OCaml's memory safety guarantees, the C code interacting with OCaml should take the OCaml memory model and racy OCaml programs into account.

In addition, the backwards-compatibility requirement is strict for single-threaded programs, but this leaves some leeway to ask programmers that they audit C code for memory-model compliance before turning on multiple domains. In this case, there should be a realistic path to making existing code compliant. It is not clear yet that such a relaxation of backwards-compatibility claims is necessary, but it is an option.

(Disclaimer that usually discussions about memory models fly over my head, but I made the effort of finding good references for what is below and I asked for expert advice who confirmed the diagnosis. All errors in the document below are mine.)

(I believe that this issue can interest @stedolan, @kayceesrk, @maranget.)


The Field macro

In this issue I will focus on the Field macro from caml/mlvalues.h which is used to read and initialize blocks in the OCaml heap, and is part of the documented FFI. It is currently implemented as a plain load/store:

#define Field(x, i) (((value *)(x)) [i])           /* Also an l-value. */

This can notably be subject to load/store tearing (even on aligned word-sized pointers) and invented load/stores (see Who's afraid of a big bad optimizing compiler?).

Example

As an example, the following is a well-defined racy OCaml program:

let t1 x = x := Some (1,1) (* thread 1 *)
let t2 x = match !x with None -> () | Some (n,m) -> Printf.printf "%d:%d" n m (* thread 2 *)
let _ =
  let x = ref None in
  {{ t1 x || t2 x }}

Thread 1 and 2 can also be implemented in C which looks like code that one can imagine seeing in real-world programs:

value t1(value x)
{
  CAMLparam1(x);
  CAMLlocal2(y,z);
  y = caml_alloc_small(1, Tag_some); // plain stores
  z = caml_alloc_small(2, 0); // plain stores
  Field(z, 0) = Val_int(0); // plain store
  Field(z, 1) = Val_int(0); // plain store
  Field(y, 0) = z; // plain store
  Store_field(x, 0, y); // release atomic store via caml_modify
  CAMLreturn(Val_unit);
}

value t2(value x)
{
  value y = Field(x, 0); // plain read
  if (Is_some(y)) {
    value z = Field(y, 0); // plain read
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
  }
  return Val_unit;
}

According to the C memory model this is UB because 1) Field does a plain read on a value with a conflicting concurrent access, and 2) even if Field did a relaxed atomic read, there is no guarantee that t2 sees the initializing writes of y, it might be looking at uninitialized memory and dereferencing garbage. The code does not do anything fancy: any implementation of a simple pattern-matching can be subject to these issues. (The variable y would be declared with CAMLlocal1 if the documentation was followed closely, but this does not fundamentally change the problem and the simplification is realistic of what to expect of expert code.)

For example, for 1) the compiler could rewrite t2 by inventing reads as:

  if (Is_some(Field(x, 0))) {
    z = Field(Field(x, 0), 0);
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1)));
  }

which could happen in practice if this code is placed in a context of high register pressure. Now the second read is not guaranteed to give the same value as the first one in a racy OCaml program, so one might be dereferencing whatever. To avoid this, the read should at least be a relaxed atomic read.

For 2), the current implementation of the OCaml memory model (adapted to the new STW collector design) intends to rely on dependency ordering being enforced in hardware (e.g. address dependency in Arm) and the compiler not doing anything crazy things that breaks dependencies.

Dependency ordering in C

  • In theory the C version is going to be UB if reads are not at least memory_order_consume.
  • In practice it falls within a de facto use of the relaxed ordering, see http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p2055r0.pdf §2.6.7 (see also §A).

    “[if] the developers are willing to live within strict coding standards [12], memory_order_relaxed may be used to head dependency chains. […] Note well that this design pattern is outside of the current standard.”

Some background for people not familiar with memory_order_consume: it is an ordering for atomic reads stronger than relaxed but much cheaper than acquire on relaxed memory models, meant to rely on chips' ordering of data dependencies, inspired by the Linux kernel's RCU synchronisation mechanism which is very cheap for the reader. In practice though it was found impossible to implement as intended and is compiled like an acquire load by all compilers. Only DEC Alpha did not respect data dependency due to cache incoherence, in theory it is also threatened by any form of value prediction, more below. This is why in practice memory_order_relaxed may be used to head dependency chains”.

The Linux kernel is a good source of info regarding the use of data dependency in practice in C. They avoid consume atomics at some cost:

  • place trust in the C compilers that Linux is supposed to be compiled with (gcc, clang, icc?) to do nothing crazy
  • use volatile reads, expected to be treated similarly to relaxed atomic reads on word-sized values on the trusted compilers (AFAIU they could also use relaxed atomics, but use volatile for a mix of code legacy, backwards-compatibility and inadequacy of the C11 memory model)
  • expect programmers to follow a strict coding discipline, cf. link above. In practice for OCaml+C code not all of these rules are necessary and the rest are difficult to break (essentially not doing anything that would let the compiler break dependencies by speculating on the value of pointers).

At least, Linux shows precedent about the need to go off-road regarding the strict confines of C11 well-defined behaviour. (The situation is eloquently described here.)

2 imperfect solutions

Therefore for the Field macro I see two (imperfect) options:

  1. Make Field a volatile cast (((volatile value *)(x)) [i]):
    • the OCaml runtime & FFI do not follow the C11 memory model
    • volatile reads are forever assumed to mimick relaxed load/stores on word-sized aligned pointers, and to head dependency chains: no exotic/futuristic ISAs even for bytecode (or hope that a compiler switch makes it well-defined in the future, like -fwrapv regularizes OCaml's assumptions on signed overflows).
    • preserves performance (hopefully)
    • clarify rules similar to the RCU “coding standards” linked above (based on the relevant subset of it). A careful read is reassuring that only things out of the ordinary can have bad consequences, but is still important to document.
    • not sufficient for legacy code to automatically enforce the OCaml memory model?
    • backwards-compatible?
  2. Make Field an atomic cast (((_Atomic value *)(x)) [i]):
    • well-defined behaviour, read/writes using Field become SC load/stores
    • definite performance impact: the idea is that code does not break but users are encouraged to progressively migrate towards newly-introduced more efficient primitives
    • in order to stay within the C11 memory model, a new primitive to read a field would still require at least consume ordering (that is, an acquire in practice, with permanent impact on performance), with some optimisations possible if immediate
    • sufficient for legacy code to automatically enforce the OCaml memory model?
    • backwards-compatible in a different manner?

Another option is to not change anything and inherit the “catch-fire” semantics of C11 when there is a race where one victim is the runtime or FFI code. But since this already invokes UB, it is unclear that there are any advantages to this compared to making Field a volatile cast.


Remaining questions

  • Decide a solution to fix the Field macro and similar macros in the same file as described above.
  • Is any undocumented rule used by realistic, professional code with OCaml 4 but invalid for the OCaml memory model? (For instance, use of Field to store an immediate when it is known that what is overwritten is an immediate?)
  • Are there other parts of the documented FFI that requires adaptation to make it realistic to conform to the OCaml memory model? Is the documentation up-to-date in this respect?
  • The Field macro is found in the C parts of the stdlib. Assuming Field is fixed, do the latter conform to the OCaml memory model?
  • The Field macro is found in the interpreter. Assuming Field is fixed, does the latter implement the OCaml memory model?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions