-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[arm64] Fix atomics #3418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] Fix atomics #3418
Conversation
Stores with release semantics guarantee that previous memory accesses are not reordered with the store. Add a full membar after it to ensure full sequencing.
Loads with acquire semantics guarantee that following memory accesses are not reordered with the load. Add full membar before it to ensure full sequencing.
Previously we were wrongly trying to achieve this by doing a load-acq followed by a store-rel. Assuming we have memory access on address A (before) and C (after) while the atomic operation happens on address B, the order of the instructions would look like this : [A], [B]load, load-acq, store-rel, [B]store, [C] In this case a reordering of memory accesses like the following is possible, which clearly breaks the desired constraints : [B]load, load-acq, [C], [A], store-rel, [B]store In order to provide the sequential memory constraint we could emit the barriers like this instead : [A], membar, [B]load, [B]store, membar, [C] We can additionally save an instruction and emit the barriers like this instead (this is also what gcc does) : [A], [B]load, store-rel, [B]store, membar, [C] In this case we only need to worry about the relation between memory accesses on A and the loading of B. We need to consider what happens if B is loaded before the access on A is fulfilled. This only matters if we load an older value for B (rather than the one visible by other threads) before A. In that case the erroneous store will fail due to using exclusive stores. At the retry the value read from B will be the right one.
|
Looks ok. |
|
LGTM |
|
I have some issues, will use this PR anyways |
| } | ||
| case OP_ATOMIC_LOAD_I1: { | ||
| code = emit_addx_imm (code, ARMREG_LR, ins->inst_basereg, ins->inst_offset); | ||
| if (ins->backend.memory_barrier_kind == MONO_MEMORY_BARRIER_SEQ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong.
We emit a full fence and then using ldarb?
It should be one or the other. dmb + ldrb or ldarb.
|
This PR is double fencing pretty much everywhere that we emit a full fence. It's wasteful. |
|
@kumpera These ops by default have acquire/release semantics, which is done by the native instruction. In some cases these ops have to achieve full sequencing. For example, in the load case (A, load acq B, C), if we only have the ldar, there is nothing preventing A to happen after B and if we only have a membar before access to B, there is nothing preventing C to happen before B, so in both cases we don't respect the full memory sequencing |
|
Hmm. Given this C program: int x;
int main()
{
__atomic_store_n (&x, 42, __ATOMIC_SEQ_CST);
}GCC emits: adrp x0, :got:x
ldr x0, [x0, #:got_lo12:x]
mov w1, 42
stlr w1, [x0]
retClang generates similar code (no explicit Sorry for the bad LGTM. Rodrigo is right that something's up here, either way. |
|
Vlad, I mean that we should have done the following: if (ins->backend.memory_barrier_kind == MONO_MEMORY_BARRIER_SEQ) {
arm_dmb (code, 0);
arm_ldrw (code, ins->dreg, ARMREG_LR);
} else {
arm_ldarw (code, ins->dreg, ARMREG_LR);
}Alex, arm64 offers the same weakly ordered model as arm32. You could say that the code clang generated is correct given fencing of the __atomic ops are only relative to other __atomic ops and not relative to all memory accesses. That makes me wonder what's happening with the compare_and_exchange stuff in atomics.h |
|
@kumpera hm, I don't quite understand why you want just Basing this on the following C program: int x;
int y;
int main()
{
y = __atomic_load_n (&x, /* kind */);
}Here are the instruction sequences that'll be used by GCC/Clang on ARM32 with different barrier kinds:
Vlad's code matches this (although we could optimize the |
|
@kumpera @alexrp if we always do dmb + ldr or dmb + str then we have sequence with other atomics ops but we lose the acquire/release semantics. Which I agree if we agree and disagree if we disagree. I'm not entirely familiar with what we're supposed to enforce. Also we should update it on arm if that's the case. Regarding the clang code I don't necessarily see it as correct. If we have a store release (which blocks from left to right in instruction order) followed by a load acquire (which blocks from right to left in instruction order), then there is nothing preventing the load to happen before the store, breaking the sequence order. Maybe the guys figured out that in this particular case it's not really relevant to program logic. |
|
Hey guys, Yeah, my comment was wrong, we need to dmb + ldarb for a sequential atomic. |
|
All this being said, our atomics in |
|
Oh well, so much for intrinsics simplifying our lifes. |
|
Just for posterity, if anyone is confused about the somewhat non-obvious barriers for the |
|
There is another issue with this PR:
what kind of barrier is intended? I assume |
|
playing around with GCC (5.4.0): void f (long *a) {
__sync_val_compare_and_swap (a, 0x1337, *a);
}
void g () {
__sync_synchronize ();
} results into: 0000000000000000 <f>:
0: f9400002 ldr x2, [x0]
4: d28266e1 mov x1, #0x1337 // #4919
8: c85f7c03 ldxr x3, [x0]
c: eb01007f cmp x3, x1
10: 54000061 b.ne 1c <f+0x1c>
14: c804fc02 stlxr w4, x2, [x0]
18: 35ffff84 cbnz w4, 8 <f+0x8>
1c: d5033bbf dmb ish
20: d65f03c0 ret
24: d503201f nop
0000000000000028 <g>:
28: d5033bbf dmb ish
2c: d65f03c0 retthat is, |
[arm64] Fix atomics Commit migrated from mono/mono@c6c9631
No description provided.