Conversation
|
My daily driver is arm64 so this PR makes me very happy ^_^ |
|
@EduardoRFS , you need to look into this... |
|
@ctk21 I didn't knew you were working on this thank you so much, I was late to the party, this PR is quite close to what I had, so we can just follow here. Will be reviewing it in the near future. About the
Yeah I was having a hard time on this, is it documented somewhere how all the magic works? |
|
We need the PR #10943 for correct compilation of memory model. Then,
See table 5 https://kcsrk.info/papers/pldi18-memory.pdf. Atomic stores are compiled as external calls, and hence, nothing needs to be done there. |
There's no documentation on this currently except for the comments in |
Except initializing stores Most stores are initializing stores, so it makes a big difference in code size. |
Indeed. The barrier is not needed for initialising stores. |
|
For non-initializing non-atomic store did you not decide for |
Yes, I believe so. We need a release store for the publication safety of initialising stores.
I don't know enough about the translation to C11 memory model. Perhaps @stedolan can say more. |
|
#10943 is merged now. Thanks for merging, @gasche. While we are reviewing Arm64, I'd like to see the codegen side of the memory model discussed and implemented in this PR as well. The codegen changes are small and IINM contained to the emission of appropriate instructions for non-atomic assignment (but not initialising) It would also be great if we can have eyes on |
I don't think let r1 = ref [||]
let t1 () = r1 := [| "Hello, world" |] (* thread 1 *)
let t2 () = print_endline (!r1.(0)) (* thread 2 *) It would be bad (crash!) if Lines 2212 to 2225 in ebb9f0d
|
|
If you know that all potentially-publishing stores of blocks necessarily go though |
|
Sorry for forcing people to work more on this, but I'd like to ask for an explanation of "publication safety" that non-experts like me can follow. (I think this is interesting in particular because arm64 is the first non-TSO backend to be merged, and the discussions here may inform other backends in the future.) caml_modify first performs an "acquire fence" and then a "release store/write" in the modified location. There is a comment in memory.c that explains why this guarantees memory-safety, but I cannot check that I agree with what it says, and or with what's going on here. In my naive understanding of memory models, release stores synchronize with acquire loads (when the load sees the result of the store), and this guarantees that any writes (atomic or not) seen by the release store will also be seen after the acquire load. (The C memory model are also "consume" loads with weaker guarantees.) But in the example of @kayceesrk above, the read of the reference (Is this related to the fence? I'm not fully clear on what these fences are doing, but my current understanding is that they ensure that if a thread sees the write to |
|
Our messages crossed, I have guessed that this relies on Arm's dependency ordering (cf. consume ordering which you mentioned, which was never implemented as intended in C compilers). But this opens more questions, for which I am preparing an issue. |
The key point here is that there is an address dependency between the two loads on Here is a broken message passing example constructed only using non-atomic locations for comparison: let msg = ref 0 and flag = ref None
let t1 () =
msg := 1;
flag := Some 42
let t2 () =
let rf = !flag in
let rm = !msg in
assert (not (rf = Some 42 && rm = 0)) (* may fail *)The write to [1] Limits on ordering, https://developer.arm.com/documentation/102376/0100/Normal-memory |
|
I must admit that I'm not familiar with the axiomatic memory model from the paper (the operational model is simple and nice, but it doesn't say anything here as we are mixing atomic and non-atomic accesses on the same location); I haven't had the brain surgery yet that makes people comfortable with axiomatic memory models. If I understand correctly, your reply basically means that on Arm, non-atomic loads behave like "consume" loads when they synchronize with "release" stores? (I mentioned that compiling all (non-atomic) reads of mutable locations to acquire loads would be safe; in fact I guess that having non-atomic reads of mutable locations act as "consume" loads suffices for this problem of block value initialization, as all potentially-problematic loads are dependent on this load.) |
I believe so. In fact, I now notice that the behaviour in this example is exactly the "publisher-subscriber situations with pointer-mediated publication" in the description in [1]. But the document also says compilers promote consume to acquire, and discourages the use of consume ordering. Hence, I think it may be better to reason about the mixed-mode behaviours in terms of the hardware model and not C++11. [1] https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering |
|
This conversation has been going above my head for a while already, but:
The discussion is all about OCaml nonatomic loads and stores, as far as I can see. (Actually, static typing prevents mixing atomic and nonatomic accesses on the same location, again as far as I can see.) So, the models from the paper should tell you which behaviors are allowed. Then, all you need is @maranget - level understanding of the ARMv8 memory model to check that the proposed implementation doesn't allow more behaviors than that. |
I understand the question as mixing OCaml and C11 models. As far as I understand , in the specific scenario, one wishes ordering from one read to another, when the second read address depends on the first read value, a.k.a. address dependency. The difficulty originates from coding the read-to-read sequence in C. Armv8 leaves address dependencies alone, i.e no fence needed here, nor load-acquire for performing the first read. But in C?
As a conclusion the cast to volatile on pointers to aligned values does not incur much penalty, if any. But well this is not C11! (*) Well Consume is here to enforce address dependency as an order, for DEC alpha, which does not provide it by default. |
|
Again, this goes above my head, but: there is only one piece of C code involved in the discussion as far as I can see, namely |
|
Agree with Xavier that the only C code here is in |
|
The confusion is a bit of my fault, I had a draft pointing out two issues with this dependency ordering, including the one with C code. I was proposing that the |
|
Thank you for bits of the missing context. I'd like to recenter the memory model discussion on this PR. If I understand correctly, here is the proposed implementation of the Multicore OCaml memory model on AArch64:
Note: the last line is the code produced by GCC 11.2 for the "acquire fence; release store" sequence in Is everyone comfortable with this implementation? Especially the use of a plain Please try to give clear answers without hidden context. [ Edited to use |
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations
Multicore support for 32-bit ARM (ARMv6/ARMv7) based on PR ocaml#10972. Key changes: - Stack switching macros (SWITCH_OCAML_TO_C, SWITCH_C_TO_OCAML) - GC register save/restore using gc_regs buckets (SAVE_ALL_REGS, RESTORE_ALL_REGS) - Fiber/effect handlers (caml_perform, caml_reperform, caml_resume, caml_runstack) - Proper trap handling with Thumb interworking support - Poll implementation for checking pending actions - ARMv6/ARMv7 memory barrier support for atomic operations

This PR implements the assembler, proc and emit changes needed to get ARM64 working. I have tested it using the compiler testsuite on MacOS/M1 and a Linux/Graviton2.
What this PR intends to do:
IloadandIstoreto implement the OCaml memory model in ARMv8.What this PR doesn't do:
I have not spent quality time plumbing through the memory model requirements; I've limited the exercise to getting the assembler up and the testsuite going. My thinking was to handle the memory model in a follow on.I'm keen to get feedback, the assembler is very much: "get it to run". There may be beneficial refactorings or ARM64 idioms (or <gasp> bugs not exercised by our testsuite) that other eyes will see.
[edited to reflect the PR now incorporates changes in emit for the memory model]