Explain mapping between OCaml memory model and C#10995
Explain mapping between OCaml memory model and C#10995stedolan wants to merge 1 commit intoocaml:trunkfrom
Conversation
|
Thanks for the write-up. I haven't had the brain surgery mentioned by Gabriel, and what you wrote here is very clear and casts your memory model in a very natural light. If the length and the formatting inhibits you from moving it directly to the C comment, you could also add it as a separate .md file (or RFC). I am still looking for more clarification about this publication safety business since it looks important to me.
|
Thanks, this is an excellent example. The output is not necessarily 1 here, and the OCaml memory model allows outcome 0. I think you're working under the C-style let t2 r =
let n = rcu_dereference r in
if r == x then Some n else NoneAn optimising compiler could easily choose to push the load under the let t2 r =
if r == x then Some !x else Nonewhich does not preserve the dependency and makes the outcome 0 visible. This sort of issue is what makes the C The idea of "publication safety" for OCaml values is much weaker than release/consume ordering. The idea is to distinguish intitialising writes from other writes, and to guarantee that any access to an object must at least see its initialising writes. This is necessary for type safety: we want to ensure that even if a reference to an object is obtained via a data race, its contents are still valid. For instance, consider this racy access: Here, the access to Conversely, this guarantee is not made if the writes are not initialising: Here, the two writes This weaker guarantee does not require tricky reasoning about dependency in the compiler. Values in OCaml are initialised as soon as they are allocated, and allocations yield fresh values initially private to a single domain. Objects are always fully initialised by a single domain before they ever become visible to a second domain, and the relevant address dependencies will always be present in the second domain because there is no other way to get the value. This is the feature that was missing in your example: there, the value You can imagine this becoming false with a compiler that speculated on the value of pointers it had never seen before, by e.g. comparing them to some constant it magics up. Happily, this sort of thing would be invalid in any case due to GC (which can change the concrete value of pointers), forcing the current and any future optimising compiler for OCaml to treat GC pointers as opaque tokens. I suppose a C compiler could in theory do this, which as mentioned in #10992 is probably a distant future worry. (This stuff didn't really show up in the PLDI paper: there was some stuff about "initial writes", but since we didn't model allocation they're not very interesting) Regarding performance, I agree it would be good to re-run some benchmarks now that the GC design has changed. I don't have a particularly strong feeling for how much the current
|
|
Thanks for the additional explanations. I was too focused on compiler transformations on the writing side but now I see that there is a problem with the reading side. (Your example with You confirm one aspect which is that initialization is meant at program-level rather than low-level allocator behaviour (e.g. like the C allocation functions which initialize everything with 1s). As you say, in the example with Some side-condition is missing to my synchronisation mechanism. For instance, if you only use it with immutable values then there cannot be any problem. Now the PLDI'18 model will already prevent you from reasoning about such programs (IIUC the limitation of the model, it would tell you that you can get the imaginary value v₀). I was wondering if a weaker side-condition could be something like “only call I also agree that non-atomic loads are not “ The last two points remind a lot of the situation with the Linux RCU vs. the C11 memory model. However compared to the former's non-standard use of For the context, my questions are motivated by something that was puzzling me: is the effort on the memory model only for the sake of giving a nice meaning to “buggy” programs, or does it allow new useful behaviours? According to http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1444.htm which introduced |
|
I'm confused. When I compile the current atomic store code ( atomic_thread_fence(memory_order_acquire);
ret = atomic_exchange(Op_atomic_val(ref), v);is becoming: I'm trying to reconcile that with this statement:
It looks to me that this C compiler is giving us an equivalent of:
which puts us in the unoptimised ARMv8 model and so no bug in What am I missing? |
|
Atomic and non-atomic stores go in pair. You cannot have the optimised version for one without the optimised version for the other and conversely. |
|
The intention of the PR was to rewrite the note at As an aside, I keep coming back to this PR and other related ones often to remind myself of the memory model intricacies. Hence, I've added the tag |
|
I don't think this is right. #12473 added the note [MMOC] whereas the present PR is about improving the note [MM] right above it. It would still be very nice to add what is in this PR description in the documentation somewhere. I also made a point about relaxed atomic reads in relationship with clarifying publication-safety, which has not been addressed yet. I think that part of the discussion has continued here: ocaml/RFCs#42. I am curious to know if efforts have been made to rerun benchmarks from the PLDI'18 paper with the added fences that are necessary for the new GC design. |
Ok. I'll reopen the PR. |
|
Thanks. @stedolan, how do you see progress being made for this PR? Presumably one just needs to copy-paste the PR description into the |
A |
Evidently, the huge comment in
memory.cisn't explaining much, and theacquire; releasecode sequence incaml_modifycontinues to mystify. Here's an attempt at a better explanation of the relationship between the OCaml memory model and C: if this one makes sense, I'll update the comment inmemory.c. I hope this answers @gadmm's questions and sheds some light on the armv8 mappings.OCaml and C memory models
LIke C, the OCaml memory model distinguishes atomic and nonatomic memory accesses, so the natural approach is to map OCaml atomics to C SC atomics (
memory_order_seq_cst, the default) and OCaml nonatomics to C relaxed atomics (memory_order_relaxed). (We cannot use C nonatomic accesses instead of relaxed here, because we want well-defined semantics even in the presence of races, which C nonatomics do not provide).However, this simple mapping does not work, because OCaml's memory model is stronger than what this mapping to C provides in two particular ways, described below:
1. Stronger atomic happens-before
Both C and OCaml define happens-before, a partial order on events of a program which defines which orderings can be relied upon for synchronisation. For atomics, the relation is stronger in OCaml: with C (SC) atomics, happens-before only arises between a write and a read that reads from it. In particular, if two domains/threads both write to an atomic variable, in OCaml one of the writes definitely happens-before the other, which is not true in C, leading to some counterintuitive examples.
Example
Below,
pis anint Atomic.tandqis anint ref, both initially 0:Is it possible for this to print
p=2andq=0? In OCaml, no. If it printsp=2, thenAtomic.set p 2must have occurred right betweenAtomic.set p 1andAtomic.get p, which means that theq := 1happens-before theset p 1which happens-before theset p 2, which happens-before the!q, so we must seeq=1.In C, with SC atomics for
Atomicand relaxed atomics forref, it can printp=2andq=0. Theq := 1happens-beforeset p 1, which is SC-beforeset p 2, which happens-before!q, but in C11 SC-before between two SC writes does not induce happens-before, soq := 1does not happen-before!qand!qmay therefore return0.The fix is to map an OCaml atomic store not to a C atomic store but to a C atomic exchange. Each atomic exchange reads from the previous one (even if it discards the answer), and this additional read is enough to induce happens-before in C, as the OCaml memory model requires.
2. Nonatomic reads are consistent with atomic operations
In C, the SC order (the total order on all SC atomic operations) is required to be consistent with happens-before, and the relaxed atomic reads-from relation (which specifies which write each read gets its value from) is also required to be consistent with happens-before. However, the SC order and the relaxed atomic reads-from relation are not required to be consistent with each other, which leads to weird effects where relaxed atomics reads can get their values from writes that are, by the SC order, in the future.
Example
Again,
pis anint Atomic.twhileqis anint ref.Is it possible to see this print "x=1 y=0"? In OCaml, the answer is no. Consider which occurs earlier in the trace, the
Atomic.set p 1or theAtomic.get p. If thesetis earlier, we will seey=1, while if thegetis earlier then the!qcannot see theq := 1, so we will seex=0.However, in C11 (with SC for
Atomic.tand relaxed forref), the outcomex=1 y=0is possible. In C11 terms, theAtomic.sethappens-before theq := 1, which is read-from by the!q, which happens-before theAtomic.get, which is SC-before theAtomic.set. This cycle of reads-from, SC-before and happens-before is permitted by the model, and actually occurs on e.g. ARM.In OCaml, the nonatomic reads-from, happens-before and the order on atomics are all required to be consistent with each other. (It is this property, more than anything else, that makes the step-by-step operational semantics possible).
Achieving this is a bit tricky in C. The easiest way is to use release-acquire rather than relaxed atomics for OCaml nonatomics, as reads-from between release-acquire atomics forms part of happens-before in C, and must therefore be consistent with the SC order.
Mapping of operations to C:
atomic_load_explicit(..., memory_order_seq_cst)atomic_exchange_explicit(..., memory_order_seq_cst)atomic_load_explicit(..., memory_order_acquire)atomic_store_explicit(..., memory_order_release)Optimising the mapping, C version
The mapping above is correct but extremely costly on non-x86 platforms. (This model was benchmarked in the PLDI paper under the name "SRA", with 3x slowdowns on some benchmarks).
Much of the cost is because nonatomic loads require a fence due to the use of
memory_order_acquire. However, this is much stronger than we need: OCaml places no ordering constraints between subsequent nonatomic loads, butmemory_order_acquireon each load forces them all to go in order.C allows explicit fences to replace memory ordering on operations, allowing one to turn:
into:
Since we are only using
acquiresemantics to maintain consistency of nonatomic reads-from and atomic ordering, we only need this fence sometime before the next event that can be read-from or is atomic, that is, before the next nonatomic write or atomic operation. In fact, we can always delay the fence until the next such operation, giving us the optimised mapping below:Optimised mapping of operations to C:
atomic_thread_fence(memory_order_acquire);atomic_load_explicit(..., memory_order_seq_cst);atomic_thread_fence(memory_order_acquire);atomic_exchange_explicit(..., memory_order_seq_cst);atomic_load_explicit(..., memory_order_relaxed);atomic_thread_fence(memory_order_acquire);atomic_store_explicit(..., memory_order_release);Optimising the mapping, ARMv8 version
I think the above is about as far as we can go while remaining strictly within the letter of the C memory model. However, when compiling to ARMv8, better instruction sequences are possible than C compilers will produce with the above mapping.
The instruction sequences produced by the above on C compilers I've tried are:
dmb ishld; ldardmb ishld; L: ldaxr; stlxr; cbnz Lldrdmb ishld; stlrFirst, the
dmb ishldin atomic stores does nothing:dmb ishldenforces ordering with respect to prior loads, which is subsumed bystlxrwhich enforced ordering with respect to prior everything. (The C compiler should really be able to elide this one, I think).Second, the
dmb ishld; stlrsequence is overly conservative. We need to ensure that nonatomic stores are ordered after prior atomic operations and nonatomic loads. (More precisely, we need happens-before between operations that synchronise with a prior atomic operator / write to a prior atomic load, and operations that read from this nonatomic store). Strictly, in C, the only way to do this is for this to be a release store (hencestlr), but the ARMv8 model is more forgiving and provides many way to ensure this ordering.One of them is to use
dmb ishldanddmb ishstand a plainstrstore.dmb ishldprovides ordering with respect to prior reads (covering both nonatomic loads and atomic ones), whiledmb ishstprovides ordering with respect to prior writes. Since we only need the ordering with respect to prior atomic writes, we can place thedmb ishstafter each atomic write rather than before nonatomic writes, yielding the optimised compilation model from the PLDI paper:dmb ishld; ldarL: ldaxr; stlxr; cbnz L; dmb stldrdmb ishld; strThis is the model adopted in #10972.
There is one additional wrinkle, though, not covered by the discussion here or the PLDI paper: when there is a data race involving a newly-allocated object, then we must guarantee that all reads that see that object at least see the initialisation of that object. This can be done by upgrading the nonatomic store to an
stlror admb ishst; strwhen it might make a newly-allocated object visible, and relying on some slightly tricky reasoning about address dependencies on the reading end. This issue is being discussed separately in #10992.Proposed code changes
Since Arm64 multicore support #10972 adopts the optimised ARMv8 model above, we must ensure that there is a suitable barrier after atomic stores. (The memory model requires ordering between atomic stores and subsequent nonatomic stores, and if we're using the optimised version of writes that only have
dmb ishldthen we need to provide this ordering on atomic stores. This is the inconsistency @gadmm noticed on #10831 - the code that's currently there is correct in isolation but requires an extra fence if used with optimised ARM stores).The barriers on
caml_modifyare only necessary to order prior initialisations, nonatomic loads, and atomic operations. In particular, in a sequence ofcaml_modify, only the first one needs barriers. Perhaps we should provide optimised versions of blit and fill, for use with arrays.caml_modifyis implemented with an acquire fence / release store. As described above, on armv8 the release store subsumes the acquire fence. While the fence is necessary according to a strict reading of the C model, I'm not aware of any architecture where it actually does anything in this situation. We should at least run benchmarks with it commented out, to see if it has a noticeable cost.I'll update the massive block comment in
memory.cwith something hopefully more readable.