Skip to content

Conversation

@BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Aug 19, 2016

No description provided.

Stores with release semantics guarantee that previous memory accesses are not reordered with the store. Add a full membar after it to ensure full sequencing.
Loads with acquire semantics guarantee that following memory accesses are not reordered with the load. Add full membar before it to ensure full sequencing.
Previously we were wrongly trying to achieve this by doing a load-acq followed by a store-rel. Assuming we have memory access on address A (before) and C (after) while the atomic operation happens on address B, the order of the instructions would look like this :

	[A], [B]load, load-acq, store-rel, [B]store, [C]

In this case a reordering of memory accesses like the following is possible, which clearly breaks the desired constraints :

	[B]load, load-acq, [C], [A], store-rel, [B]store

In order to provide the sequential memory constraint we could emit the barriers like this instead :

	[A], membar, [B]load, [B]store, membar, [C]

We can additionally save an instruction and emit the barriers like this instead (this is also what gcc does) :

	[A], [B]load, store-rel, [B]store, membar, [C]

In this case we only need to worry about the relation between memory accesses on A and the loading of B. We need to consider what happens if B is loaded before the access on A is fulfilled. This only matters if we load an older value for B (rather than the one visible by other threads) before A. In that case the erroneous store will fail due to using exclusive stores. At the retry the value read from B will be the right one.
@vargaz
Copy link
Contributor

vargaz commented Aug 19, 2016

Looks ok.

@alexrp
Copy link
Contributor

alexrp commented Aug 19, 2016

LGTM

@BrzVlad BrzVlad merged commit c6c9631 into mono:master Aug 19, 2016
@kumpera
Copy link
Contributor

kumpera commented Aug 19, 2016

I have some issues, will use this PR anyways

}
case OP_ATOMIC_LOAD_I1: {
code = emit_addx_imm (code, ARMREG_LR, ins->inst_basereg, ins->inst_offset);
if (ins->backend.memory_barrier_kind == MONO_MEMORY_BARRIER_SEQ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong.

We emit a full fence and then using ldarb?

It should be one or the other. dmb + ldrb or ldarb.

@kumpera
Copy link
Contributor

kumpera commented Aug 20, 2016

This PR is double fencing pretty much everywhere that we emit a full fence. It's wasteful.

@BrzVlad
Copy link
Member Author

BrzVlad commented Aug 20, 2016

@kumpera These ops by default have acquire/release semantics, which is done by the native instruction. In some cases these ops have to achieve full sequencing. For example, in the load case (A, load acq B, C), if we only have the ldar, there is nothing preventing A to happen after B and if we only have a membar before access to B, there is nothing preventing C to happen before B, so in both cases we don't respect the full memory sequencing

@alexrp
Copy link
Contributor

alexrp commented Aug 20, 2016

Hmm. Given this C program:

int x;

int main()
{
    __atomic_store_n (&x, 42, __ATOMIC_SEQ_CST);
}

GCC emits:

    adrp    x0, :got:x
    ldr x0, [x0, #:got_lo12:x]
    mov w1, 42
    stlr    w1, [x0]
    ret

Clang generates similar code (no explicit dmb or other fencing instructions). Are GCC and Clang wrong, or are we forgetting some aspect of the ARM64 memory model here?

Sorry for the bad LGTM. Rodrigo is right that something's up here, either way.

@kumpera
Copy link
Contributor

kumpera commented Aug 21, 2016

Vlad, I mean that we should have done the following:

if (ins->backend.memory_barrier_kind == MONO_MEMORY_BARRIER_SEQ) {
   arm_dmb (code, 0);
   arm_ldrw (code, ins->dreg, ARMREG_LR);
} else {
   arm_ldarw (code, ins->dreg, ARMREG_LR);
}

Alex, arm64 offers the same weakly ordered model as arm32. You could say that the code clang generated is correct given fencing of the __atomic ops are only relative to other __atomic ops and not relative to all memory accesses.

That makes me wonder what's happening with the compare_and_exchange stuff in atomics.h

@alexrp
Copy link
Contributor

alexrp commented Aug 21, 2016

@kumpera hm, I don't quite understand why you want just dmb + ldrw in the sequential consistency case? If the ARM64 memory model is like ARM32's, then the logic for an atomic load (which can either be relaxed, acquire, or seq_cst) should be:

if (kind == seq_cst)
    barrier ();
load ();
if (kind != relaxed)
    barrier ();

Basing this on the following C program:

int x;
int y;

int main()
{
    y = __atomic_load_n (&x, /* kind */);
}

Here are the instruction sequences that'll be used by GCC/Clang on ARM32 with different barrier kinds:

__ATOMIC_RELAXED __ATOMIC_ACQUIRE __ATOMIC_SEQ_CST
ldr ldr + dmb dmb + ldr + dmb

Vlad's code matches this (although we could optimize the relaxed case to use a plain load/store without acquire/release semantics, but that's not a correctness issue).

@BrzVlad
Copy link
Member Author

BrzVlad commented Aug 21, 2016

@kumpera @alexrp if we always do dmb + ldr or dmb + str then we have sequence with other atomics ops but we lose the acquire/release semantics. Which I agree if we agree and disagree if we disagree. I'm not entirely familiar with what we're supposed to enforce. Also we should update it on arm if that's the case.

Regarding the clang code I don't necessarily see it as correct. If we have a store release (which blocks from left to right in instruction order) followed by a load acquire (which blocks from right to left in instruction order), then there is nothing preventing the load to happen before the store, breaking the sequence order. Maybe the guys figured out that in this particular case it's not really relevant to program logic.

@kumpera
Copy link
Contributor

kumpera commented Aug 22, 2016

Hey guys,

Yeah, my comment was wrong, we need to dmb + ldarb for a sequential atomic.

@alexrp
Copy link
Contributor

alexrp commented Aug 22, 2016

All this being said, our atomics in atomic.h are super broken, though. GCC/Clang will generate different code for __sync builtins than the code we generate in the JIT, wrt barriers. And since we use these functions as fallbacks if we can't use JIT intrinsics... we'll do the wrong thing for managed code. And maybe for runtime code too, who knows.

@kumpera
Copy link
Contributor

kumpera commented Aug 22, 2016

Oh well, so much for intrinsics simplifying our lifes.

@alexrp
Copy link
Contributor

alexrp commented Aug 22, 2016

Just for posterity, if anyone is confused about the somewhat non-obvious barriers for the add, exchange, and cas opcodes: http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html (see also the two followups)

@lewurm
Copy link
Contributor

lewurm commented Oct 12, 2017

There is another issue with this PR: arm_dmb (code, 0) (where 0 defines CRm) is not defined by the ARMv8 ARM [cf. C6.2.65], and the documentation says the following about it:

All other encodings of CRm that are not listed above are reserved, and can be encoded using the #
syntax. It is IMPLEMENTATION DEFINED whether options other than SY are implemented. All
unsupported and reserved options must execute as a full system barrier operation, but software must
not rely on this behavior

what kind of barrier is intended? I assume SY (0xf)?

@lewurm
Copy link
Contributor

lewurm commented Oct 12, 2017

playing around with GCC (5.4.0):

void f (long *a) {
        __sync_val_compare_and_swap (a, 0x1337, *a);
}

void g () {
        __sync_synchronize ();
} 

results into:

0000000000000000 <f>:
   0:   f9400002        ldr     x2, [x0]
   4:   d28266e1        mov     x1, #0x1337                     // #4919
   8:   c85f7c03        ldxr    x3, [x0]
   c:   eb01007f        cmp     x3, x1
  10:   54000061        b.ne    1c <f+0x1c>
  14:   c804fc02        stlxr   w4, x2, [x0]
  18:   35ffff84        cbnz    w4, 8 <f+0x8>
  1c:   d5033bbf        dmb     ish
  20:   d65f03c0        ret
  24:   d503201f        nop
  
0000000000000028 <g>:
  28:   d5033bbf        dmb     ish
  2c:   d65f03c0        ret

that is, dmb ish (== arm_dmb (code, 0xb) is used. Hmm.

@lewurm lewurm mentioned this pull request Oct 13, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants