[arm64] Fix atomics #3418

BrzVlad · 2016-08-19T01:16:58Z

No description provided.

Stores with release semantics guarantee that previous memory accesses are not reordered with the store. Add a full membar after it to ensure full sequencing.

Loads with acquire semantics guarantee that following memory accesses are not reordered with the load. Add full membar before it to ensure full sequencing.

Previously we were wrongly trying to achieve this by doing a load-acq followed by a store-rel. Assuming we have memory access on address A (before) and C (after) while the atomic operation happens on address B, the order of the instructions would look like this : [A], [B]load, load-acq, store-rel, [B]store, [C] In this case a reordering of memory accesses like the following is possible, which clearly breaks the desired constraints : [B]load, load-acq, [C], [A], store-rel, [B]store In order to provide the sequential memory constraint we could emit the barriers like this instead : [A], membar, [B]load, [B]store, membar, [C] We can additionally save an instruction and emit the barriers like this instead (this is also what gcc does) : [A], [B]load, store-rel, [B]store, membar, [C] In this case we only need to worry about the relation between memory accesses on A and the loading of B. We need to consider what happens if B is loaded before the access on A is fulfilled. This only matters if we load an older value for B (rather than the one visible by other threads) before A. In that case the erroneous store will fail due to using exclusive stores. At the retry the value read from B will be the right one.

vargaz · 2016-08-19T06:23:49Z

Looks ok.

alexrp · 2016-08-19T07:27:43Z

LGTM

kumpera · 2016-08-19T19:46:05Z

I have some issues, will use this PR anyways

kumpera · 2016-08-20T18:11:40Z

mono/mini/mini-arm64.c

 		}
 		case OP_ATOMIC_LOAD_I1: {
 			code = emit_addx_imm (code, ARMREG_LR, ins->inst_basereg, ins->inst_offset);
+			if (ins->backend.memory_barrier_kind == MONO_MEMORY_BARRIER_SEQ)


This looks wrong.

We emit a full fence and then using ldarb?

It should be one or the other. dmb + ldrb or ldarb.

kumpera · 2016-08-20T18:13:44Z

This PR is double fencing pretty much everywhere that we emit a full fence. It's wasteful.

BrzVlad · 2016-08-20T18:30:34Z

@kumpera These ops by default have acquire/release semantics, which is done by the native instruction. In some cases these ops have to achieve full sequencing. For example, in the load case (A, load acq B, C), if we only have the ldar, there is nothing preventing A to happen after B and if we only have a membar before access to B, there is nothing preventing C to happen before B, so in both cases we don't respect the full memory sequencing

alexrp · 2016-08-20T21:34:11Z

Hmm. Given this C program:

int x;

int main()
{
    __atomic_store_n (&x, 42, __ATOMIC_SEQ_CST);
}

GCC emits:

    adrp    x0, :got:x
    ldr x0, [x0, #:got_lo12:x]
    mov w1, 42
    stlr    w1, [x0]
    ret

Clang generates similar code (no explicit dmb or other fencing instructions). Are GCC and Clang wrong, or are we forgetting some aspect of the ARM64 memory model here?

Sorry for the bad LGTM. Rodrigo is right that something's up here, either way.

kumpera · 2016-08-21T01:39:09Z

Vlad, I mean that we should have done the following:

if (ins->backend.memory_barrier_kind == MONO_MEMORY_BARRIER_SEQ) {
   arm_dmb (code, 0);
   arm_ldrw (code, ins->dreg, ARMREG_LR);
} else {
   arm_ldarw (code, ins->dreg, ARMREG_LR);
}

Alex, arm64 offers the same weakly ordered model as arm32. You could say that the code clang generated is correct given fencing of the __atomic ops are only relative to other __atomic ops and not relative to all memory accesses.

That makes me wonder what's happening with the compare_and_exchange stuff in atomics.h

alexrp · 2016-08-21T03:39:54Z

@kumpera hm, I don't quite understand why you want just dmb + ldrw in the sequential consistency case? If the ARM64 memory model is like ARM32's, then the logic for an atomic load (which can either be relaxed, acquire, or seq_cst) should be:

if (kind == seq_cst)
    barrier ();
load ();
if (kind != relaxed)
    barrier ();

Basing this on the following C program:

int x;
int y;

int main()
{
    y = __atomic_load_n (&x, /* kind */);
}

Here are the instruction sequences that'll be used by GCC/Clang on ARM32 with different barrier kinds:

`__ATOMIC_RELAXED`	`__ATOMIC_ACQUIRE`	`__ATOMIC_SEQ_CST`
`ldr`	`ldr` + `dmb`	`dmb` + `ldr` + `dmb`

Vlad's code matches this (although we could optimize the relaxed case to use a plain load/store without acquire/release semantics, but that's not a correctness issue).

BrzVlad · 2016-08-21T05:15:39Z

@kumpera @alexrp if we always do dmb + ldr or dmb + str then we have sequence with other atomics ops but we lose the acquire/release semantics. Which I agree if we agree and disagree if we disagree. I'm not entirely familiar with what we're supposed to enforce. Also we should update it on arm if that's the case.

Regarding the clang code I don't necessarily see it as correct. If we have a store release (which blocks from left to right in instruction order) followed by a load acquire (which blocks from right to left in instruction order), then there is nothing preventing the load to happen before the store, breaking the sequence order. Maybe the guys figured out that in this particular case it's not really relevant to program logic.

kumpera · 2016-08-22T15:37:06Z

Hey guys,

Yeah, my comment was wrong, we need to dmb + ldarb for a sequential atomic.

alexrp · 2016-08-22T16:01:45Z

All this being said, our atomics in atomic.h are super broken, though. GCC/Clang will generate different code for __sync builtins than the code we generate in the JIT, wrt barriers. And since we use these functions as fallbacks if we can't use JIT intrinsics... we'll do the wrong thing for managed code. And maybe for runtime code too, who knows.

kumpera · 2016-08-22T16:24:14Z

Oh well, so much for intrinsics simplifying our lifes.

alexrp · 2016-08-22T18:30:38Z

Just for posterity, if anyone is confused about the somewhat non-obvious barriers for the add, exchange, and cas opcodes: http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html (see also the two followups)

lewurm · 2017-10-12T22:17:11Z

There is another issue with this PR: arm_dmb (code, 0) (where 0 defines CRm) is not defined by the ARMv8 ARM [cf. C6.2.65], and the documentation says the following about it:

All other encodings of CRm that are not listed above are reserved, and can be encoded using the #
syntax. It is IMPLEMENTATION DEFINED whether options other than SY are implemented. All
unsupported and reserved options must execute as a full system barrier operation, but software must
not rely on this behavior

what kind of barrier is intended? I assume SY (0xf)?

lewurm · 2017-10-12T22:29:05Z

playing around with GCC (5.4.0):

void f (long *a) {
        __sync_val_compare_and_swap (a, 0x1337, *a);
}

void g () {
        __sync_synchronize ();
}

results into:

0000000000000000 <f>:
   0:   f9400002        ldr     x2, [x0]
   4:   d28266e1        mov     x1, #0x1337                     // #4919
   8:   c85f7c03        ldxr    x3, [x0]
   c:   eb01007f        cmp     x3, x1
  10:   54000061        b.ne    1c <f+0x1c>
  14:   c804fc02        stlxr   w4, x2, [x0]
  18:   35ffff84        cbnz    w4, 8 <f+0x8>
  1c:   d5033bbf        dmb     ish
  20:   d65f03c0        ret
  24:   d503201f        nop
  
0000000000000028 <g>:
  28:   d5033bbf        dmb     ish
  2c:   d65f03c0        ret

that is, dmb ish (== arm_dmb (code, 0xb) is used. Hmm.

[arm64] Fix atomics Commit migrated from mono/mono@c6c9631

BrzVlad added 3 commits August 18, 2016 12:33

[arm64] Add full memory sequencing to atomic stores when necessary

4c1dad1

Stores with release semantics guarantee that previous memory accesses are not reordered with the store. Add a full membar after it to ensure full sequencing.

[arm64] Add full memory sequencing to atomic loads when necessary

322e7d0

Loads with acquire semantics guarantee that following memory accesses are not reordered with the load. Add full membar before it to ensure full sequencing.

dnfclas added the cla-already-signed label Aug 19, 2016

BrzVlad merged commit c6c9631 into mono:master Aug 19, 2016

kumpera reviewed Aug 20, 2016
View reviewed changes

lewurm mentioned this pull request Oct 13, 2017

[arm64] use dmb ish #5780

Merged

alexrp mentioned this pull request Dec 16, 2021

Arm64: Memory barrier improvements dotnet/runtime#62895

Merged

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request mono/mono#3418 from BrzVlad/fix-arm64-finalizer-wait

8cfde4f

[arm64] Fix atomics Commit migrated from mono/mono@c6c9631

[arm64] Fix atomics #3418

[arm64] Fix atomics #3418

Uh oh!

Conversation

BrzVlad commented Aug 19, 2016

Uh oh!

vargaz commented Aug 19, 2016

Uh oh!

alexrp commented Aug 19, 2016

Uh oh!

kumpera commented Aug 19, 2016

Uh oh!

kumpera Aug 20, 2016

Choose a reason for hiding this comment

Uh oh!

kumpera commented Aug 20, 2016

Uh oh!

BrzVlad commented Aug 20, 2016

Uh oh!

alexrp commented Aug 20, 2016

Uh oh!

kumpera commented Aug 21, 2016

Uh oh!

alexrp commented Aug 21, 2016

Uh oh!

BrzVlad commented Aug 21, 2016

Uh oh!

kumpera commented Aug 22, 2016

Uh oh!

alexrp commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kumpera commented Aug 22, 2016

Uh oh!

alexrp commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewurm commented Oct 12, 2017

Uh oh!

lewurm commented Oct 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alexrp commented Aug 22, 2016 •

edited

Loading

alexrp commented Aug 22, 2016 •

edited

Loading