-
Notifications
You must be signed in to change notification settings - Fork 1.2k
OCaml multicore memory model and C (runtime, FFI, VM) #10992
Description
Even simple OCaml programs depend on bits written in C, via the runtime or the FFI, and all bytecode OCaml programs are run by an interpreter written in C. C11 has so-called “catch-fire” semantics for data races, which means that the program can (even in practice) crash or do worse things in the presence of a data race. The OCaml multicore memory model (proposed in the PLDI'18 paper) on the other hand aims to guarantee that even racy programs have well-defined behaviour.
- To prevent the C code from threatening OCaml's local DRF property, the C code interacting with OCaml should implement the OCaml memory model.
The backwards-compatibility guarantees of multicore motivates (ideally) retrofitting this property onto existing programs, including in terms of performance preservation, which might prove challenging. If the local DRF property is (reasonably) not required of all C code in existence (which is initially going to be legacy code inherited from OCaml 4), then we can weaken the requirement as follows:
- To prevent the catch-fire semantics of C11 from threatening OCaml's memory safety guarantees, the C code interacting with OCaml should take the OCaml memory model and racy OCaml programs into account.
In addition, the backwards-compatibility requirement is strict for single-threaded programs, but this leaves some leeway to ask programmers that they audit C code for memory-model compliance before turning on multiple domains. In this case, there should be a realistic path to making existing code compliant. It is not clear yet that such a relaxation of backwards-compatibility claims is necessary, but it is an option.
(Disclaimer that usually discussions about memory models fly over my head, but I made the effort of finding good references for what is below and I asked for expert advice who confirmed the diagnosis. All errors in the document below are mine.)
(I believe that this issue can interest @stedolan, @kayceesrk, @maranget.)
The Field macro
In this issue I will focus on the Field macro from caml/mlvalues.h which is used to read and initialize blocks in the OCaml heap, and is part of the documented FFI. It is currently implemented as a plain load/store:
#define Field(x, i) (((value *)(x)) [i]) /* Also an l-value. */This can notably be subject to load/store tearing (even on aligned word-sized pointers) and invented load/stores (see Who's afraid of a big bad optimizing compiler?).
Example
As an example, the following is a well-defined racy OCaml program:
let t1 x = x := Some (1,1) (* thread 1 *)
let t2 x = match !x with None -> () | Some (n,m) -> Printf.printf "%d:%d" n m (* thread 2 *)
let _ =
let x = ref None in
{{ t1 x || t2 x }}Thread 1 and 2 can also be implemented in C which looks like code that one can imagine seeing in real-world programs:
value t1(value x)
{
CAMLparam1(x);
CAMLlocal2(y,z);
y = caml_alloc_small(1, Tag_some); // plain stores
z = caml_alloc_small(2, 0); // plain stores
Field(z, 0) = Val_int(0); // plain store
Field(z, 1) = Val_int(0); // plain store
Field(y, 0) = z; // plain store
Store_field(x, 0, y); // release atomic store via caml_modify
CAMLreturn(Val_unit);
}
value t2(value x)
{
value y = Field(x, 0); // plain read
if (Is_some(y)) {
value z = Field(y, 0); // plain read
printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
}
return Val_unit;
}According to the C memory model this is UB because 1) Field does a plain read on a value with a conflicting concurrent access, and 2) even if Field did a relaxed atomic read, there is no guarantee that t2 sees the initializing writes of y, it might be looking at uninitialized memory and dereferencing garbage. The code does not do anything fancy: any implementation of a simple pattern-matching can be subject to these issues. (The variable y would be declared with CAMLlocal1 if the documentation was followed closely, but this does not fundamentally change the problem and the simplification is realistic of what to expect of expert code.)
For example, for 1) the compiler could rewrite t2 by inventing reads as:
if (Is_some(Field(x, 0))) {
z = Field(Field(x, 0), 0);
printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1)));
}which could happen in practice if this code is placed in a context of high register pressure. Now the second read is not guaranteed to give the same value as the first one in a racy OCaml program, so one might be dereferencing whatever. To avoid this, the read should at least be a relaxed atomic read.
For 2), the current implementation of the OCaml memory model (adapted to the new STW collector design) intends to rely on dependency ordering being enforced in hardware (e.g. address dependency in Arm) and the compiler not doing anything crazy things that breaks dependencies.
Dependency ordering in C
- In theory the C version is going to be UB if reads are not at least
memory_order_consume. - In practice it falls within a de facto use of the
relaxedordering, see http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p2055r0.pdf §2.6.7 (see also §A).“[if] the developers are willing to live within strict coding standards [12],
memory_order_relaxedmay be used to head dependency chains. […] Note well that this design pattern is outside of the current standard.”
Some background for people not familiar with memory_order_consume: it is an ordering for atomic reads stronger than relaxed but much cheaper than acquire on relaxed memory models, meant to rely on chips' ordering of data dependencies, inspired by the Linux kernel's RCU synchronisation mechanism which is very cheap for the reader. In practice though it was found impossible to implement as intended and is compiled like an acquire load by all compilers. Only DEC Alpha did not respect data dependency due to cache incoherence, in theory it is also threatened by any form of value prediction, more below. This is why in practice “memory_order_relaxed may be used to head dependency chains”.
The Linux kernel is a good source of info regarding the use of data dependency in practice in C. They avoid consume atomics at some cost:
- place trust in the C compilers that Linux is supposed to be compiled with (gcc, clang, icc?) to do nothing crazy
- use
volatilereads, expected to be treated similarly to relaxed atomic reads on word-sized values on the trusted compilers (AFAIU they could also use relaxed atomics, but use volatile for a mix of code legacy, backwards-compatibility and inadequacy of the C11 memory model) - expect programmers to follow a strict coding discipline, cf. link above. In practice for OCaml+C code not all of these rules are necessary and the rest are difficult to break (essentially not doing anything that would let the compiler break dependencies by speculating on the value of pointers).
At least, Linux shows precedent about the need to go off-road regarding the strict confines of C11 well-defined behaviour. (The situation is eloquently described here.)
2 imperfect solutions
Therefore for the Field macro I see two (imperfect) options:
- Make
Fieldavolatilecast (((volatile value *)(x)) [i]):- the OCaml runtime & FFI do not follow the C11 memory model
- volatile reads are forever assumed to mimick relaxed load/stores on word-sized aligned pointers, and to head dependency chains: no exotic/futuristic ISAs even for bytecode (or hope that a compiler switch makes it well-defined in the future, like
-fwrapvregularizes OCaml's assumptions on signed overflows). - preserves performance (hopefully)
- clarify rules similar to the RCU “coding standards” linked above (based on the relevant subset of it). A careful read is reassuring that only things out of the ordinary can have bad consequences, but is still important to document.
- not sufficient for legacy code to automatically enforce the OCaml memory model?
- backwards-compatible?
- Make
Fieldanatomiccast (((_Atomic value *)(x)) [i]):- well-defined behaviour, read/writes using
Fieldbecome SC load/stores - definite performance impact: the idea is that code does not break but users are encouraged to progressively migrate towards newly-introduced more efficient primitives
- in order to stay within the C11 memory model, a new primitive to read a field would still require at least
consumeordering (that is, anacquirein practice, with permanent impact on performance), with some optimisations possible if immediate - sufficient for legacy code to automatically enforce the OCaml memory model?
- backwards-compatible in a different manner?
- well-defined behaviour, read/writes using
Another option is to not change anything and inherit the “catch-fire” semantics of C11 when there is a race where one victim is the runtime or FFI code. But since this already invokes UB, it is unclear that there are any advantages to this compared to making Field a volatile cast.
Remaining questions
- Decide a solution to fix the
Fieldmacro and similar macros in the same file as described above. - Is any undocumented rule used by realistic, professional code with OCaml 4 but invalid for the OCaml memory model? (For instance, use of
Fieldto store an immediate when it is known that what is overwritten is an immediate?) - Are there other parts of the documented FFI that requires adaptation to make it realistic to conform to the OCaml memory model? Is the documentation up-to-date in this respect?
- The
Fieldmacro is found in the C parts of the stdlib. AssumingFieldis fixed, do the latter conform to the OCaml memory model? - The
Fieldmacro is found in the interpreter. AssumingFieldis fixed, does the latter implement the OCaml memory model?