OCaml multicore memory model and C (runtime, FFI, VM)

Even simple OCaml programs depend on bits written in C, via the runtime or the FFI, and all bytecode OCaml programs are run by an interpreter written in C. C11 has so-called “catch-fire” semantics for data races, which means that the program can (even in practice) crash or do worse things in the presence of a data race. The OCaml multicore memory model (proposed in the PLDI'18 paper) on the other hand aims to guarantee that even racy programs have well-defined behaviour.

- To prevent the C code from threatening OCaml's _local DRF property_, the C code interacting with OCaml should implement the OCaml memory model.

The backwards-compatibility guarantees of multicore motivates (ideally) retrofitting this property onto existing programs, including in terms of performance preservation, which might prove challenging. If the local DRF property is (reasonably) not required of _all_ C code in existence (which is initially going to be legacy code inherited from OCaml 4), then we can weaken the requirement as follows:

- To prevent the catch-fire semantics of C11 from threatening OCaml's memory safety guarantees, the C code interacting with OCaml should take the OCaml memory model and racy OCaml programs into account.

In addition, the backwards-compatibility requirement is strict for single-threaded programs, but this leaves some leeway to ask programmers that they audit C code for memory-model compliance before turning on multiple domains. In this case, there should be a realistic path to making existing code compliant. It is not clear yet that such a relaxation of backwards-compatibility claims is necessary, but it is an option.

(Disclaimer that _usually_ discussions about memory models fly over my head, but I made the effort of finding good references for what is below and I asked for expert advice who [confirmed the diagnosis](https://github.com/ocaml/ocaml/pull/10972#issuecomment-1030238681). All errors in the document below are mine.)

(I believe that this issue can interest @stedolan, @kayceesrk, @maranget.)

---

### The `Field` macro

In this issue I will focus on the `Field` macro from `caml/mlvalues.h` which is used to read and initialize blocks in the OCaml heap, and is part of the documented FFI. It is currently implemented as a plain load/store:
```c
#define Field(x, i) (((value *)(x)) [i])           /* Also an l-value. */
```
This can notably be subject to load/store tearing (even on aligned word-sized pointers) and invented load/stores (see [Who's afraid of a big bad optimizing compiler?](https://lwn.net/Articles/793253/)).

#### Example

As an example, the following is a well-defined racy OCaml program:
```ocaml
let t1 x = x := Some (1,1) (* thread 1 *)
let t2 x = match !x with None -> () | Some (n,m) -> Printf.printf "%d:%d" n m (* thread 2 *)
let _ =
  let x = ref None in
  {{ t1 x || t2 x }}
```
Thread 1 and 2 can also be implemented in C which looks like code that one can imagine seeing in real-world programs:
```c
value t1(value x)
{
  CAMLparam1(x);
  CAMLlocal2(y,z);
  y = caml_alloc_small(1, Tag_some); // plain stores
  z = caml_alloc_small(2, 0); // plain stores
  Field(z, 0) = Val_int(0); // plain store
  Field(z, 1) = Val_int(0); // plain store
  Field(y, 0) = z; // plain store
  Store_field(x, 0, y); // release atomic store via caml_modify
  CAMLreturn(Val_unit);
}

value t2(value x)
{
  value y = Field(x, 0); // plain read
  if (Is_some(y)) {
    value z = Field(y, 0); // plain read
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
  }
  return Val_unit;
}
```
According to the C memory model this is UB because 1) `Field` does a plain read on a value with a conflicting concurrent access, and 2) even if `Field` did a relaxed atomic read, there is no guarantee that t2 sees the initializing writes of y, it might be looking at uninitialized memory and dereferencing garbage. The code does not do anything fancy: any implementation of a simple pattern-matching can be subject to these issues. (The variable `y` would be declared with `CAMLlocal1` if the documentation was followed closely, but this does not fundamentally change the problem and the simplification is realistic of what to expect of expert code.)

For example, for 1) the compiler could rewrite t2 by inventing reads as:
```c
  if (Is_some(Field(x, 0))) {
    z = Field(Field(x, 0), 0);
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1)));
  }
```
which could happen in practice if this code is placed in a context of high register pressure. Now the second read is not guaranteed to give the same value as the first one in a racy OCaml program, so one might be dereferencing whatever. To avoid this, the read should at least be a `relaxed` atomic read.

For 2), the current implementation of the OCaml memory model (adapted to the new STW collector design) intends to rely on _dependency ordering_ being enforced in hardware (e.g. address dependency in Arm) and the compiler not doing anything crazy things that breaks dependencies.

#### Dependency ordering in C

- In theory the C version is going to be UB if reads are not at least `memory_order_consume`.
- In practice it falls within a de facto use of the `relaxed` ordering, see http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p2055r0.pdf §2.6.7 (see also §A).
  > “[if] the developers are willing to live within [strict coding standards [12]](https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt), `memory_order_relaxed` may be used to head dependency chains. […] Note well that this design pattern is outside of the current standard.”

Some background for people not familiar with `memory_order_consume`: it is an ordering for atomic reads stronger than `relaxed` but much cheaper than `acquire` on relaxed memory models, meant to rely on chips' ordering of data dependencies, inspired by the Linux kernel's RCU synchronisation mechanism which is very cheap for the reader. In practice though it was found impossible to implement as intended and is compiled like an acquire load by all compilers. Only DEC Alpha did not respect data dependency due to cache incoherence, in theory it is also threatened by any form of value prediction, more below. This is why in practice _“`memory_order_relaxed` may be used to head dependency chains”_.

The Linux kernel is a good source of info regarding the use of data dependency in practice in C. They avoid `consume` atomics at some cost:
- place trust in the C compilers that Linux is supposed to be compiled with (gcc, clang, icc?) to do nothing crazy
- use [`volatile` reads](https://github.com/torvalds/linux/blob/master/include/asm-generic/rwonce.h), expected to be treated similarly to relaxed atomic reads on word-sized values on the trusted compilers (AFAIU they could also use relaxed atomics, but use volatile for a mix of code legacy, backwards-compatibility and inadequacy of the C11 memory model)
- expect programmers to follow a strict coding discipline, cf. link above. In practice for OCaml+C code not all of these rules are necessary and the rest are difficult to break (essentially not doing anything that would let the compiler break dependencies by speculating on the value of pointers).

At least, Linux shows precedent about the need to go off-road regarding the strict confines of C11 well-defined behaviour. (The situation is eloquently [described here](https://lkml.org/lkml/2012/2/1/521).)

#### 2 imperfect solutions

Therefore for the `Field` macro I see two (imperfect) options:

1. Make `Field` a `volatile` cast (`((volatile value *)(x)) [i]`):
   - the OCaml runtime & FFI do not follow the C11 memory model
   - volatile reads are forever assumed to mimick relaxed load/stores on word-sized aligned pointers, and to head dependency chains: no exotic/futuristic ISAs even for bytecode (or hope that a compiler switch makes it well-defined in the future, like `-fwrapv` regularizes OCaml's assumptions on signed overflows).
   - preserves performance (hopefully)
   - clarify rules similar to the RCU “coding standards” linked above (based on the relevant subset of it). A careful read is reassuring that only things out of the ordinary can have bad consequences, but is still important to document.
   - not sufficient for legacy code to automatically enforce the OCaml memory model?
   - backwards-compatible?
2. Make `Field` an `atomic` cast (`((_Atomic value *)(x)) [i]`):
   - well-defined behaviour, read/writes using `Field` become SC load/stores
   - definite performance impact: the idea is that code does not break but users are encouraged to progressively migrate towards newly-introduced more efficient primitives
   - in order to stay within the C11 memory model, a new primitive to read a field would still require at least `consume` ordering (that is, an `acquire` in practice, with permanent impact on performance), with some optimisations possible if immediate
   - sufficient for legacy code to automatically enforce the OCaml memory model?
   - backwards-compatible in a different manner?

Another option is to not change anything and inherit the “catch-fire” semantics of C11 when there is a race where one victim is the runtime or FFI code. But since this already invokes UB, it is unclear that there are any advantages to this compared to making `Field` a volatile cast.

---

### Remaining questions

- [ ] Decide a solution to fix the `Field` macro and similar macros in the same file as described above.
- [ ] Is any undocumented rule used by realistic, professional code with OCaml 4 but invalid for the OCaml memory model? (For instance, use of `Field` to store an immediate when it is known that what is overwritten is an immediate?)
- [ ] Are there other parts of the documented FFI that requires adaptation to make it realistic to conform to the OCaml memory model? Is the documentation up-to-date in this respect?
- [ ] The `Field` macro is found in the C parts of the stdlib. Assuming `Field` is fixed, do the latter conform to the OCaml memory model?
- [ ] The `Field` macro is found in the interpreter. Assuming `Field` is fixed, does the latter implement the OCaml memory model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCaml multicore memory model and C (runtime, FFI, VM) #10992

The `Field` macro

Example

Dependency ordering in C

2 imperfect solutions

Remaining questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OCaml multicore memory model and C (runtime, FFI, VM) #10992

Description

The Field macro

Example

Dependency ordering in C

2 imperfect solutions

Remaining questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The `Field` macro