Skip to content

[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+#161638

Merged
Pierre-vh merged 1 commit intomainfrom
users/pierre-vh/fix-cumode-fence-gfx10
Oct 21, 2025
Merged

[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+#161638
Pierre-vh merged 1 commit intomainfrom
users/pierre-vh/fix-cumode-fence-gfx10

Conversation

@Pierre-vh
Copy link
Contributor

@Pierre-vh Pierre-vh commented Oct 2, 2025

They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following sequence of events in one thread:

  • some stores
  • store atomic release syncscope("workgroup")
  • barrier

then another thread follows with

  • barrier
  • load atomic acquire
  • store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with global_wb. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance.

So the choice here is to streamline the code sequences by making CU and WGP mode emit almost identical (vL0 inv is not needed in CU mode) code for release (or stronger) atomic ordering.

This also removes the vm_vsrc(0) wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed.

Supersedes #160501
Solves SC1-6454

Copy link
Contributor Author

Pierre-vh commented Oct 2, 2025

@github-actions
Copy link

github-actions bot commented Oct 2, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef([^a-zA-Z0-9_-]|$)|UndefValue::get)' 'HEAD~1' HEAD llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll

The following files introduce new uses of undef:

  • llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@Pierre-vh Pierre-vh marked this pull request as ready for review October 2, 2025 08:24
@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following sequence of events in one thread:

  • some stores
  • store atomic release syncscope("workgroup")
  • barrier

then another thread follows with

  • barrier
  • load atomic acquire
  • store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with global_wb. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance.

So the choice here is to streamline the code sequences by making CU and WGP mode emit identical code for release (or stronger) atomic ordering.

This also removes the vm_vsrc(0) wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed.

Supersedes #160501
Solves SC1-6454


Patch is 428.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161638.diff

16 Files Affected:

  • (modified) llvm/docs/AMDGPUUsage.rst (+3-58)
  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+14-37)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll (+24-6)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+8-12)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (+48)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (+48-9)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (+8-3)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (+601-185)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (+8-3)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (+540-92)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (+240-90)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll (+240-90)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (+240-90)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (+8-3)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (+240-90)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 74b7604fda56d..cba86b3d5447e 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -13229,9 +13229,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
                                               - generic     vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
                                                            - Could be split into
@@ -13277,8 +13274,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          2. buffer/global/flat_store
      store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit.
                                                            - If OpenCL, omit.
                                                            - Could be split into
                                                              separate s_waitcnt
@@ -13366,9 +13361,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
                                               - generic     vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit lgkmcnt(0).
                                                            - Could be split into
                                                              separate s_waitcnt
@@ -13413,8 +13405,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          2. buffer/global/flat_atomic
      atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit.
                                                            - If OpenCL, omit.
                                                            - Could be split into
                                                              separate s_waitcnt
@@ -13498,9 +13488,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
@@ -13627,9 +13614,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
                                                            - Must happen after
@@ -13681,8 +13665,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          2. buffer/global_atomic
                                                          3. s_waitcnt vm/vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit.
                                                            - Use vmcnt(0) if atomic with
                                                              return and vscnt(0) if
                                                              atomic with no-return.
@@ -13707,8 +13689,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
 
      atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit.
                                                            - If OpenCL, omit.
                                                            - Could be split into
                                                              separate s_waitcnt
@@ -13768,9 +13748,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL, omit lgkmcnt(0).
                                                            - Could be split into
                                                              separate s_waitcnt
@@ -13816,9 +13793,9 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
                                                          3. s_waitcnt lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
+                                                           - If atomic with return, omit
+                                                             vscnt(0), if atomic with
+                                                             no-return, omit vmcnt(0).
                                                            - If OpenCL, omit lgkmcnt(0).
                                                            - Must happen before
                                                              the following
@@ -13991,9 +13968,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
                                                             vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
@@ -14223,9 +14197,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
                                               - generic     vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit vmcnt(0) and
-                                                             vscnt(0).
                                                            - Could be split into
                                                              separate s_waitcnt
                                                              vmcnt(0), s_waitcnt
@@ -14334,8 +14305,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
 
                                                          1. s_waitcnt vmcnt(0) & vscnt(0)
 
-                                                           - If CU wavefront execution
-                                                             mode, omit.
                                                            - Could be split into
                                                              separate s_waitcnt
                                                              vmcnt(0) and s_waitcnt
@@ -15337,8 +15306,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit ``s_wait_dscnt 0x0``.
                                                            - The waits can be
@@ -15384,8 +15351,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit.
                                                            - The waits can be
@@ -15479,8 +15444,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit ``s_wait_dscnt 0x0``.
                                                            - If OpenCL and CU wavefront
@@ -15530,8 +15493,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit all.
                                                            - The waits can be
@@ -15623,8 +15584,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit ``s_wait_dscnt 0x0``.
                                                            - If OpenCL and
@@ -15754,8 +15713,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit ``s_wait_dscnt 0x0``.
                                                            - Must happen after
@@ -15812,8 +15769,6 @@ the instruction in the code sequence that references the table.
                                                             | **Atomic without return:**
                                                             | ``s_wait_storecnt 0x0``
 
-                                                           - If CU wavefront execution
-                                                             mode, omit.
                                                            - Must happen before
                                                              the following
                                                              ``global_inv``.
@@ -15838,8 +15793,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit.
                                                            - The waits can be
@@ -15901,8 +15854,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit ``s_wait_loadcnt 0x0``.
                                                            - The waits can be
@@ -16154,8 +16105,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL and
                                                              address space is
@@ -16384,8 +16333,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit
                                                              ``s_wait_dscnt 0x0``
@@ -16492,8 +16439,6 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
-                                                            | **CU wavefront execution mode:**
-                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If OpenCL, omit all.
                                                            - The waits can be
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 362ef140a28f8..29c326893539b 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -390,12 +390,6 @@ class SICacheControl {
                              bool IsCrossAddrSpaceOrdering,
                              Position Pos) const = 0;
 
-  /// Inserts any necessary instructions before the barrier start instruction
-  /// \p MI in order to support pairing of barriers and fences.
-  virtual bool insertBarrierStart(MachineBasicBlock::iterator &MI) const {
-    return false;
-  };
-
   /// Virtual destructor to allow derivations to be deleted.
   virtual ~SICacheControl() = default;
 };
@@ -576,12 +570,8 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
                   bool IsCrossAddrSpaceOrdering, Position Pos,
                   AtomicOrdering Order, bool AtomicsOnly) const override;
 
-  bool insertAcquire(MachineBasicBlock::iterator &MI,
-                     SIAtomicScope Scope,
-                     SIAtomicAddrSpace AddrSpace,
-                     Position Pos) const override;
-
-  bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+  bool insertAcquire(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+                     SIAtomicAddrSpace AddrSpace, Position Pos) const override;
 };
 
 class SIGfx11CacheControl : public SIGfx10CacheControl {
@@ -2046,8 +2036,11 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
       // the WGP. Therefore need to wait for operations to complete to ensure
       // they are visible to waves in the other CU as the L0 is per CU.
       // Otherwise in CU mode and all waves of a work-group are on the same CU
-      // which shares the same L0.
-      if (!ST.isCuModeEnabled()) {
+      // which shares the same L0. Note that we still need to wait when
+      // performing a release in this mode to respect the transitivity of
+      // happens-before, e.g. other waves of the workgroup must be able to
+      // release the memory from another wave at a wider scope.
+      if (!ST.isCuModeEnabled() || isReleaseOrStronger(Order)) {
         if ((Op & SIMemOp::LOAD) != SIMemOp::NONE)
           VMCnt |= true;
         if ((Op & SIMemOp::STORE) != SIMemOp::NONE)
@@ -2202,22 +2195,6 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::i...
[truncated]

Copy link
Collaborator

@ssahasra ssahasra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I would have preferred modifying #160501 so that the whole discussion was immediately reachable from git blame without an extra hop of seeing which PR was superseded by this one.

Copy link
Contributor Author

Pierre-vh commented Oct 20, 2025

Merge activity

  • Oct 20, 8:59 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Oct 20, 10:05 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:12 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:16 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:22 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:29 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:34 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:40 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 10:56 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 11:09 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 11:17 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 11:20 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 11:40 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 11:49 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 11:56 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 12:01 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 12:05 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 12:20 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 12:23 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 12:36 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 12:41 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 1:04 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 1:14 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 1:32 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 1:47 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 1:50 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 1:55 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:07 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:16 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:19 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:21 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:30 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:41 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:48 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:51 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:54 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 2:59 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 3:17 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 3:24 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 3:26 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 3:29 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 3:56 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:06 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:16 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:19 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:22 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:28 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:31 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:35 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:37 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 4:57 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 5:00 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 5:04 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 5:12 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 5:21 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 5:32 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 5:44 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 6:03 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 6:07 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 6:10 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 6:14 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 6:17 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 6:36 PM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 20, 7:12 PM UTC: Graphite rebased this pull request as part of a merge.

@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/remove-extra-wait-after-rmw branch 5 times, most recently from acc43ac to 0a3ec0f Compare October 20, 2025 09:30
Base automatically changed from users/pierre-vh/remove-extra-wait-after-rmw to main October 20, 2025 10:03
@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/fix-cumode-fence-gfx10 branch 10 times, most recently from 72c7723 to c10b131 Compare October 20, 2025 11:17
@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/fix-cumode-fence-gfx10 branch from c12397c to 7b8dd96 Compare October 20, 2025 14:29
@ritter-x2a
Copy link
Member

Pierre-vh force-pushed the users/pierre-vh/fix-cumode-fence-gfx10 branch 15 times, most recently from 9989044 to d89770d

Is this normal graphite behaviour? It is sending out a lot of email notifications!

I think we might be seeing this: #151135 The workaround is to land each PR in a stack individually, one after the other.

I tried cancelling the merge for @Pierre-vh in the Graphite UI to not block CI resources, but Graphite reports an error when I do that (and apparently keeps merging).

@Pierre-vh Pierre-vh force-pushed the users/pierre-vh/fix-cumode-fence-gfx10 branch 26 times, most recently from 175bf5d to c3057ee Compare October 20, 2025 18:02
They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance.

So the choice here is to streamline the code sequences by making  CU and WGP mode emit identical code for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed.
@Pierre-vh
Copy link
Contributor Author

Pierre-vh force-pushed the users/pierre-vh/fix-cumode-fence-gfx10 branch 15 times, most recently from 9989044 to d89770d

Is this normal graphite behaviour? It is sending out a lot of email notifications!

I think we might be seeing this: #151135

The workaround is to land each PR in a stack individually, one after the other.

My apologies for the inconvenience, I thought it had merged and logged off for the day. I cancelled the merge.
Next time I will merge manually because this isn't the first time graphite does this.
Feel free to close my PRs if it happens again, I can always re-open them after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants