Skip to content

Explain mapping between OCaml memory model and C#10995

Open
stedolan wants to merge 1 commit intoocaml:trunkfrom
stedolan:memory-model-note
Open

Explain mapping between OCaml memory model and C#10995
stedolan wants to merge 1 commit intoocaml:trunkfrom
stedolan:memory-model-note

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

@stedolan stedolan commented Feb 6, 2022

Evidently, the huge comment in memory.c isn't explaining much, and the acquire; release code sequence in caml_modify continues to mystify. Here's an attempt at a better explanation of the relationship between the OCaml memory model and C: if this one makes sense, I'll update the comment in memory.c. I hope this answers @gadmm's questions and sheds some light on the armv8 mappings.

OCaml and C memory models

LIke C, the OCaml memory model distinguishes atomic and nonatomic memory accesses, so the natural approach is to map OCaml atomics to C SC atomics (memory_order_seq_cst, the default) and OCaml nonatomics to C relaxed atomics (memory_order_relaxed). (We cannot use C nonatomic accesses instead of relaxed here, because we want well-defined semantics even in the presence of races, which C nonatomics do not provide).

However, this simple mapping does not work, because OCaml's memory model is stronger than what this mapping to C provides in two particular ways, described below:

1. Stronger atomic happens-before

Both C and OCaml define happens-before, a partial order on events of a program which defines which orderings can be relied upon for synchronisation. For atomics, the relation is stronger in OCaml: with C (SC) atomics, happens-before only arises between a write and a read that reads from it. In particular, if two domains/threads both write to an atomic variable, in OCaml one of the writes definitely happens-before the other, which is not true in C, leading to some counterintuitive examples.

Example

Below, p is an int Atomic.t and q is an int ref, both initially 0:

(* domain 1 *)
q := 1;
Atomic.set p 1;
Printf.printf "p=%d\n" (Atomic.get p)

(* domain 2 *)
Atomic.set p 2;
Printf.printf "q=%d\n" !q

Is it possible for this to print p=2 and q=0? In OCaml, no. If it prints p=2, then Atomic.set p 2 must have occurred right between Atomic.set p 1 and Atomic.get p, which means that the q := 1 happens-before the set p 1 which happens-before the set p 2, which happens-before the !q, so we must see q=1.

In C, with SC atomics for Atomic and relaxed atomics for ref, it can print p=2 and q=0. The q := 1 happens-before set p 1, which is SC-before set p 2, which happens-before !q, but in C11 SC-before between two SC writes does not induce happens-before, so q := 1 does not happen-before !q and !q may therefore return 0.

The fix is to map an OCaml atomic store not to a C atomic store but to a C atomic exchange. Each atomic exchange reads from the previous one (even if it discards the answer), and this additional read is enough to induce happens-before in C, as the OCaml memory model requires.

2. Nonatomic reads are consistent with atomic operations

In C, the SC order (the total order on all SC atomic operations) is required to be consistent with happens-before, and the relaxed atomic reads-from relation (which specifies which write each read gets its value from) is also required to be consistent with happens-before. However, the SC order and the relaxed atomic reads-from relation are not required to be consistent with each other, which leads to weird effects where relaxed atomics reads can get their values from writes that are, by the SC order, in the future.

Example

Again, p is an int Atomic.t while q is an int ref.

(* domain 1 *)
Atomic.set p 1;
q := 1

(* domain 2 *)
let x = !q in
let y = Atomic.get p in
Printf.printf "x=%d y=%d\n" x y

Is it possible to see this print "x=1 y=0"? In OCaml, the answer is no. Consider which occurs earlier in the trace, the Atomic.set p 1 or the Atomic.get p. If the set is earlier, we will see y=1, while if the get is earlier then the !q cannot see the q := 1, so we will see x=0.

However, in C11 (with SC for Atomic.t and relaxed for ref), the outcome x=1 y=0 is possible. In C11 terms, the Atomic.set happens-before the q := 1, which is read-from by the !q, which happens-before the Atomic.get, which is SC-before the Atomic.set. This cycle of reads-from, SC-before and happens-before is permitted by the model, and actually occurs on e.g. ARM.

In OCaml, the nonatomic reads-from, happens-before and the order on atomics are all required to be consistent with each other. (It is this property, more than anything else, that makes the step-by-step operational semantics possible).

Achieving this is a bit tricky in C. The easiest way is to use release-acquire rather than relaxed atomics for OCaml nonatomics, as reads-from between release-acquire atomics forms part of happens-before in C, and must therefore be consistent with the SC order.


Mapping of operations to C:

OCaml operation C operation
Atomic load atomic_load_explicit(..., memory_order_seq_cst)
Atomic store atomic_exchange_explicit(..., memory_order_seq_cst)
Nonatomic load atomic_load_explicit(..., memory_order_acquire)
Nonatomic store atomic_store_explicit(..., memory_order_release)

Optimising the mapping, C version

The mapping above is correct but extremely costly on non-x86 platforms. (This model was benchmarked in the PLDI paper under the name "SRA", with 3x slowdowns on some benchmarks).

Much of the cost is because nonatomic loads require a fence due to the use of memory_order_acquire. However, this is much stronger than we need: OCaml places no ordering constraints between subsequent nonatomic loads, but memory_order_acquire on each load forces them all to go in order.

C allows explicit fences to replace memory ordering on operations, allowing one to turn:

val = atomic_load_explicit(..., memory_order_acquire);

into:

val = atomic_load_explicit(..., memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);

Since we are only using acquire semantics to maintain consistency of nonatomic reads-from and atomic ordering, we only need this fence sometime before the next event that can be read-from or is atomic, that is, before the next nonatomic write or atomic operation. In fact, we can always delay the fence until the next such operation, giving us the optimised mapping below:


Optimised mapping of operations to C:

OCaml operation C operation
Atomic load atomic_thread_fence(memory_order_acquire);
atomic_load_explicit(..., memory_order_seq_cst);
Atomic store atomic_thread_fence(memory_order_acquire);
atomic_exchange_explicit(..., memory_order_seq_cst);
Nonatomic load atomic_load_explicit(..., memory_order_relaxed);
Nonatomic store atomic_thread_fence(memory_order_acquire);
atomic_store_explicit(..., memory_order_release);

Optimising the mapping, ARMv8 version

I think the above is about as far as we can go while remaining strictly within the letter of the C memory model. However, when compiling to ARMv8, better instruction sequences are possible than C compilers will produce with the above mapping.

The instruction sequences produced by the above on C compilers I've tried are:

OCaml operation ARMv8 operation
Atomic load dmb ishld; ldar
Atomic store dmb ishld; L: ldaxr; stlxr; cbnz L
Nonatomic load ldr
Nonatomic store dmb ishld; stlr

First, the dmb ishld in atomic stores does nothing: dmb ishld enforces ordering with respect to prior loads, which is subsumed by stlxr which enforced ordering with respect to prior everything. (The C compiler should really be able to elide this one, I think).

Second, the dmb ishld; stlr sequence is overly conservative. We need to ensure that nonatomic stores are ordered after prior atomic operations and nonatomic loads. (More precisely, we need happens-before between operations that synchronise with a prior atomic operator / write to a prior atomic load, and operations that read from this nonatomic store). Strictly, in C, the only way to do this is for this to be a release store (hence stlr), but the ARMv8 model is more forgiving and provides many way to ensure this ordering.

One of them is to use dmb ishld and dmb ishst and a plain str store. dmb ishld provides ordering with respect to prior reads (covering both nonatomic loads and atomic ones), while dmb ishst provides ordering with respect to prior writes. Since we only need the ordering with respect to prior atomic writes, we can place the dmb ishst after each atomic write rather than before nonatomic writes, yielding the optimised compilation model from the PLDI paper:

OCaml operation ARMv8 operation
Atomic load dmb ishld; ldar
Atomic store L: ldaxr; stlxr; cbnz L; dmb st
Nonatomic load ldr
Nonatomic store dmb ishld; str

This is the model adopted in #10972.

There is one additional wrinkle, though, not covered by the discussion here or the PLDI paper: when there is a data race involving a newly-allocated object, then we must guarantee that all reads that see that object at least see the initialisation of that object. This can be done by upgrading the nonatomic store to an stlr or a dmb ishst; str when it might make a newly-allocated object visible, and relying on some slightly tricky reasoning about address dependencies on the reading end. This issue is being discussed separately in #10992.


Proposed code changes

  • Since Arm64 multicore support #10972 adopts the optimised ARMv8 model above, we must ensure that there is a suitable barrier after atomic stores. (The memory model requires ordering between atomic stores and subsequent nonatomic stores, and if we're using the optimised version of writes that only have dmb ishld then we need to provide this ordering on atomic stores. This is the inconsistency @gadmm noticed on #10831 - the code that's currently there is correct in isolation but requires an extra fence if used with optimised ARM stores).

  • The barriers on caml_modify are only necessary to order prior initialisations, nonatomic loads, and atomic operations. In particular, in a sequence of caml_modify, only the first one needs barriers. Perhaps we should provide optimised versions of blit and fill, for use with arrays.

  • caml_modify is implemented with an acquire fence / release store. As described above, on armv8 the release store subsumes the acquire fence. While the fence is necessary according to a strict reading of the C model, I'm not aware of any architecture where it actually does anything in this situation. We should at least run benchmarks with it commented out, to see if it has a noticeable cost.

  • I'll update the massive block comment in memory.c with something hopefully more readable.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Feb 6, 2022

Thanks for the write-up. I haven't had the brain surgery mentioned by Gabriel, and what you wrote here is very clear and casts your memory model in a very natural light. If the length and the formatting inhibits you from moving it directly to the C comment, you could also add it as a separate .md file (or RFC).

I am still looking for more clarification about this publication safety business since it looks important to me.

  • Should the “non-atomic reads” line here or in table 2 of the paper not be something like “ldr + preserve dependency”, to inform valid program transformations? This is what I see in memory model descriptions when there is a dependency ordering. I think this would still be an aspect of the memory model that one has to be aware of and have documented (for example for researchers who experiment with post-processing the OCaml compiler's output, "feedback-based optimisation" mentioned elsewhere, future contributors to flambda who might have crazy ideas, etc.).

  • I would be surprised if flambda does not already respect the dependency ordering; on the other hand implementing C11 consume ordering has proved difficult for compiler writers. Intuitively, I would say that OCaml's problem is simpler. But I am not an expert! Can you state the reason why we think that respecting the dependency ordering is going to be easy for compiler writers?

  • The PLDI paper was interesting notably because it proposed a fairly controlled experiment. But it was assuming the concurrent collector with a very different publication scheme, letting you use str instead of stlr. It is not clear that this release store can eventually be optimised most of the time (though various ideas can be tried). Do you have an idea of the performance impact of the release store?

  • One way of looking at the publication-safety claim is to consider the following would-be implementation of a Linux-RCU-cheap shared variable (by "Linux RCU" I refer to the very cheap implementation for readers, not the reclamation scheme since we can use the GC to manage memory):

    struct
      type 'a t = 'a ref
      let fence = Atomic.make ()
      let rcu_make x = ref x
      let rcu_dereference r = !r
      let rcu_update r x =
        (* proxy for a release fence, enforce store-store ordering *)
        Atomic.set fence () ; 
        r := x
    end :
    sig
      type 'a t
      val rcu_make : 'a -> 'a t
      val rcu_dereference : 'a t -> 'a
      val rcu_update : 'a t -> 'a -> unit
    end

    I believe that this makes a useful synchronisation mechanism in multicore OCaml, where the read is a simple ldr on Arm, like the rcu_dereference of the Linux kernel (which was the motivation for the consume ordering).
    Indeed, consider the following example:

    let x = ref 0
    let t1 r = (* thread 1 *)
      x := 1 ;
      rcu_update r x
    let t2 r = (* thread 2 *)
      print_int (rcu_dereference r)
    let _ = 
      let r = rcu_make (ref 1) in
      {{ t1 r || t2 r }}

    I believe that the publication safety claims amount to saying that the output is necessarily 1. However, it is not clear to me that the OCaml memory model from the paper does not allow the outcome 0. Does enriching the memory model with what is needed to prove publication safety leads to a stronger model, which would justify the use of this synchronisation mechanism?

@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Feb 7, 2022

I believe that the publication safety claims amount to saying that the output is necessarily 1. However, it is not clear to me that the OCaml memory model from the paper does not allow the outcome 0.

Thanks, this is an excellent example. The output is not necessarily 1 here, and the OCaml memory model allows outcome 0.

I think you're working under the C-style release/consume model, which does in theory allow only the outcome 1. The fences we're inserting on arm64 are sufficient to provide this, but as you've noted correctly implementing this requires some complicated and poorly understood mechanism for preserving dependencies in the compiler. It is difficult to even specify what it means to preserve these dependencies. For instance, suppose t2 instead looked like this:

let t2 r =
  let n = rcu_dereference r in
  if r == x then Some n else None

An optimising compiler could easily choose to push the load under the if (noting that it is needed only in the true branch), and then replace a load of r with a load of x (since we're now inside the comparison r == x), yielding:

let t2 r =
  if r == x then Some !x else None

which does not preserve the dependency and makes the outcome 0 visible. This sort of issue is what makes the C consume ordering more or less unworkable in practice, causing compilers to interpret it as a synonym for acquire.


The idea of "publication safety" for OCaml values is much weaker than release/consume ordering. The idea is to distinguish intitialising writes from other writes, and to guarantee that any access to an object must at least see its initialising writes. This is necessary for type safety: we want to ensure that even if a reference to an object is obtained via a data race, its contents are still valid. For instance, consider this racy access:

{{ let z = ref 42 in g := z    ||   print_int !(!g) }}

Here, the access to !g may or may not return z, but if it does then the contents of z will definitely be 42. Under no circumstances will you see uninitialised garbage as the contents of z.

Conversely, this guarantee is not made if the writes are not initialising:

{{ z := 42; g := z    ||    print_int !(!g) }}

Here, the two writes z := 42 and g := z are not ordered in any way, and the other thread may see !g return z and yet observe the old value of z. (In practice, this is unlikely to happen, but compiler optimisations may make it occur).

This weaker guarantee does not require tricky reasoning about dependency in the compiler. Values in OCaml are initialised as soon as they are allocated, and allocations yield fresh values initially private to a single domain. Objects are always fully initialised by a single domain before they ever become visible to a second domain, and the relevant address dependencies will always be present in the second domain because there is no other way to get the value. This is the feature that was missing in your example: there, the value x was already known to the second thread before the write occurred, allowing the second thread to speculate on its value.

You can imagine this becoming false with a compiler that speculated on the value of pointers it had never seen before, by e.g. comparing them to some constant it magics up. Happily, this sort of thing would be invalid in any case due to GC (which can change the concrete value of pointers), forcing the current and any future optimising compiler for OCaml to treat GC pointers as opaque tokens. I suppose a C compiler could in theory do this, which as mentioned in #10992 is probably a distant future worry.

(This stuff didn't really show up in the PLDI paper: there was some stuff about "initial writes", but since we didn't model allocation they're not very interesting)


Regarding performance, I agree it would be good to re-run some benchmarks now that the GC design has changed. I don't have a particularly strong feeling for how much the current caml_modify fence costs, but if it's high there are some ways to cheapen it:

  • we could drop to armv8 assembly and use explicit dmb ishst for publication safety, which may be cheaper than the release store.
  • since publication safety only requires fencing between initialisation and mutation, we could optimise the generated code of ocamlopt to skip the fence when there has definitely not been an allocation that could be published by a given mutation.

@ctk21 ctk21 mentioned this pull request Feb 8, 2022
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Feb 8, 2022

Thanks for the additional explanations. I was too focused on compiler transformations on the writing side but now I see that there is a problem with the reading side.

(Your example with t2 does not type-check but I see your point with something like if n == x then ...)

You confirm one aspect which is that initialization is meant at program-level rather than low-level allocator behaviour (e.g. like the C allocation functions which initialize everything with 1s). As you say, in the example with ref 42 one is guaranteed to see 42 and not some other memory-safe value, and I think this is already useful for programming. My intuition was that it is not possible to prove publication safety without proving a usefully stronger result, and you seem to confirm it.

Some side-condition is missing to my synchronisation mechanism. For instance, if you only use it with immutable values then there cannot be any problem. Now the PLDI'18 model will already prevent you from reasoning about such programs (IIUC the limitation of the model, it would tell you that you can get the imaginary value v₀).

I was wondering if a weaker side-condition could be something like “only call rcu_update with fresh values”. However, for this to work, freshness should be meant transitively. This might be of limited use in practice with mutable data. I think the rule is really “avoid operations after rcu_dereference that let the compiler speculate physically on its return value”.

I also agree that non-atomic loads are not “ldr + preserve dependencies” in the same sense as consume. I also get the intuitive point that everything a compiler reasonably does will preserve the dependencies needed for memory safety. But I still think that compiler writers are going to implement some unspecified invariant (consciously or not).

The last two points remind a lot of the situation with the Linux RCU vs. the C11 memory model. However compared to the former's non-standard use of volatile in C, OCaml offers stable and portable guarantees, because the behaviour is inherently needed for memory safety.


For the context, my questions are motivated by something that was puzzling me: is the effort on the memory model only for the sake of giving a nice meaning to “buggy” programs, or does it allow new useful behaviours? According to http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1444.htm which introduced memory_order_consume, leveraging dependency ordering "has resulted in order-of-magnitude speedups" in two scenarios (often-read-rarely-written concurrent data structures, pointer-mediated publication).

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Feb 9, 2022

I'm confused. When I compile the current atomic store code (caml_atomic_exchange) with Apple clang version 13.0.0:

    atomic_thread_fence(memory_order_acquire);
    ret = atomic_exchange(Op_atomic_val(ref), v);

is becoming:

        dmb     ishld
        swpal   x21, x20, [x19]

I'm trying to reconcile that with this statement:

Since #10972 adopts the optimised ARMv8 model above, we must ensure that there is a suitable barrier after atomic stores

It looks to me that this C compiler is giving us an equivalent of:

dmb ishld; L: ldaxr; stlxr; cbnz L

which puts us in the unoptimised ARMv8 model and so no bug in caml_atomic_exchange.
I also see #10972 as not touching atomic stores since this is the job of caml_atomic_exchange.

What am I missing?

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Feb 9, 2022

Atomic and non-atomic stores go in pair. You cannot have the optimised version for one without the optimised version for the other and conversely.

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Feb 9, 2022

Ah, I might be getting somewhere. At the time of writing, #10972 has for non-atomic stores that do not go through caml_modify:

if assignment then ` dmb ishld\n`;
` str {emit_reg src}, {emit_addressing addr base}\n`

The point of contention is that this should be:
dmb ishld; stlr

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Jun 21, 2024

The intention of the PR was to rewrite the note at memory.c. The note has since been expanded in #12473, and this PR text itself serves as good documentation. Hence, I shall close this PR.

As an aside, I keep coming back to this PR and other related ones often to remind myself of the memory model intricacies. Hence, I've added the tag memory-model to this and the related ones. Hope this will be useful for others.

@kayceesrk kayceesrk closed this Jun 21, 2024
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jun 21, 2024

I don't think this is right. #12473 added the note [MMOC] whereas the present PR is about improving the note [MM] right above it. It would still be very nice to add what is in this PR description in the documentation somewhere.

I also made a point about relaxed atomic reads in relationship with clarifying publication-safety, which has not been addressed yet. I think that part of the discussion has continued here: ocaml/RFCs#42.

I am curious to know if efforts have been made to rerun benchmarks from the PLDI'18 paper with the added fences that are necessary for the new GC design.

@kayceesrk
Copy link
Copy Markdown
Contributor

It would still be very nice to add what is in this PR description in the documentation somewhere.

Ok. I'll reopen the PR.

@kayceesrk kayceesrk reopened this Jun 22, 2024
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jun 24, 2024

Thanks. @stedolan, how do you see progress being made for this PR? Presumably one just needs to copy-paste the PR description into the memory.c comment, or into an md file referenced there.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Jun 25, 2024

how do you see progress being made for this PR?

A memory-model.md file with the details here, which also points to the original paper, the memory model sections of the manual, note in the memory.c file, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants