Skip to content

[LoopUnroll] Combine partial sub/fsub reduction results with add/fadd#201066

Open
MarshallOfSound wants to merge 2 commits into
llvm:mainfrom
MarshallOfSound:unroll-sub-reduction-combine-fix
Open

[LoopUnroll] Combine partial sub/fsub reduction results with add/fadd#201066
MarshallOfSound wants to merge 2 commits into
llvm:mainfrom
MarshallOfSound:unroll-sub-reduction-combine-fix

Conversation

@MarshallOfSound

@MarshallOfSound MarshallOfSound commented Jun 2, 2026

Copy link
Copy Markdown

When introducing parallel reduction phis during unrolling (#149470), each partial accumulator starts at the recurrence's identity value and applies the loop's updates (including any subtractions) independently, so each partial result already carries the correct sign. The final result must therefore always sum the partial results.

The combine currently uses the opcode of the recurrence kind, which is sub for RecurKind::Sub and fsub for RecurKind::FSub (reachable via #166630 with reassoc). Chaining the partial results with sub computes pN - (... - (p1 - p0)), flipping the sign of every other partial result and miscompiling the reduction.

Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO and PGO, where a hand-written NEON sub-form reduction (counting bytes >= 0x80, as in simdutf's arm64 utf8_length_from_latin1) is hot enough to be unrolled with parallel reduction phis.

The first commit adds tests for sub/fsub reductions showing the current incorrect combines; the second applies the fix.

Fixes #201065

Substantial parts of this patch were developed with the assistance of Claude Code (see the Assisted-by trailers) per the AI tool policy — I've reviewed all of it.

@MarshallOfSound

Copy link
Copy Markdown
Author

cc @fhahn @juliannagele — would appreciate a review, this fixes a miscompile in the parallel reduction phi combining from #149470 / #166630.

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Hello @MarshallOfSound 👋

Thank you for submitting a Pull Request (PR) to the LLVM Project. Since this is your first PR, here are a few useful links covering our main contribution policies and review practices.

  • All contributions to LLVM must follow our LLVM AI Tool Use Policy. In particular, if you used AI while working on this PR, remember to add a note to the PR description.
  • The LLVM Code-Review Policy and Practices document contains practical information about the PR process, including how patches are reviewed and accepted, and who can review a PR.
  • Our LLVM Developer Policy describes our expectations for code quality, commit summaries and contains notes on our CI system.

Please reply to this message to confirm that you have read these policies, especially the LLVM AI Tool Use Policy, and that any AI tool usage has been noted in the PR description.


Frequently asked questions

How do I add reviewers?

This PR will be automatically labeled, and the relevant teams will be notified. For some parts of the project, reviewers may also be added automatically.

You can also add reviewers manually using the Reviewers section on this page. If you cannot use that section, it is probably because you do not have write permissions for the repository. In that case, you can request a review by tagging reviewers in a comment using @ followed by their GitHub username.

What if there are no comments?

If you have not received any comments on your PR after a week, you can request a review by pinging the PR with a comment such as “Ping”. The common courtesy ping rate is once a week. Please remember that you are asking for volunteer time from other developers.

Are any special GitHub settings required to contribute to LLVM?

We only require contributors to have a public email address associated with their GitHub commits, see this section of LLVM Developer Policy for details.


If you have questions, feel free to leave a comment on this PR, or ask on LLVM Discord or LLVM Discourse.

Thank you,
The LLVM Community

@llvmorg-github-actions

Copy link
Copy Markdown

@llvm/pr-subscribers-llvm-transforms

Author: Samuel Attard (MarshallOfSound)

Changes

When introducing parallel reduction phis during unrolling (#149470), each partial accumulator starts at the recurrence's identity value and applies the loop's updates (including any subtractions) independently, so each partial result already carries the correct sign. The final result must therefore always sum the partial results.

The combine currently uses the opcode of the recurrence kind, which is sub for RecurKind::Sub and fsub for RecurKind::FSub (reachable via #166630 with reassoc). Chaining the partial results with sub computes pN - (... - (p1 - p0)), flipping the sign of every other partial result and miscompiling the reduction.

Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO and PGO, where a hand-written NEON sub-form reduction (counting bytes >= 0x80, as in simdutf's arm64 utf8_length_from_latin1) is hot enough to be unrolled with parallel reduction phis.

The first commit adds tests for sub/fsub reductions showing the current incorrect combines; the second applies the fix.

Fixes #201065


Full diff: https://github.com/llvm/llvm-project/pull/201066.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/LoopUnroll.cpp (+13-3)
  • (modified) llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll (+138)
diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index ffb4d1037f0f2..8d4327f0b0c27 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -1533,10 +1533,20 @@ llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI,
         IRBuilder Builder(ExitBlock, ExitBlock->getFirstNonPHIIt());
         Builder.setFastMathFlags(Reductions.begin()->second.getFastMathFlags());
         RecurKind RK = Reductions.begin()->second.getRecurrenceKind();
+        // Each partial reduction starts at the recurrence's identity value
+        // and applies the loop's updates (including any subtractions)
+        // independently, so each partial result already carries the correct
+        // sign. The partial results must always be summed, even for sub-form
+        // recurrences; chaining them with the recurrence opcode (sub/fsub)
+        // would flip the sign of every other partial result.
+        unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RK);
+        if (RdxOpcode == Instruction::Sub)
+          RdxOpcode = Instruction::Add;
+        else if (RdxOpcode == Instruction::FSub)
+          RdxOpcode = Instruction::FAdd;
         for (Instruction *RdxPart : drop_begin(PartialReductions)) {
-          RdxResult = Builder.CreateBinOp(
-              (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK),
-              RdxPart, RdxResult, "bin.rdx");
+          RdxResult = Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode,
+                                          RdxPart, RdxResult, "bin.rdx");
         }
         NeedToFixLCSSA = true;
         for (Instruction *RdxPart : PartialReductions)
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll b/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll
index 840cc6c507c3d..40bfd86a52ef9 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll
+++ b/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll
@@ -484,6 +484,142 @@ exit:
   ret float %res
 }
 
+define i32 @test_sub_reduction(ptr %a, i64 %n) {
+;
+; CHECK-LABEL: define i32 @test_sub_reduction(
+; CHECK-SAME: ptr [[A:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
+; CHECK:       [[ENTRY_NEW]]:
+; CHECK-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX_1:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[GEP_A]], align 2
+; CHECK-NEXT:    [[RDX_NEXT]] = sub i32 [[RDX]], [[TMP2]]
+; CHECK-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[GEP_A_1]], align 2
+; CHECK-NEXT:    [[RDX_NEXT_1]] = sub i32 [[RDX_1]], [[TMP3]]
+; CHECK-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
+; CHECK-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
+; CHECK-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
+; CHECK-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]], !llvm.loop [[LOOP9:![0-9]+]]
+; CHECK:       [[EXIT_UNR_LCSSA]]:
+; CHECK-NEXT:    [[RES_PH:%.*]] = phi i32 [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX_UNR:%.*]] = phi i32 [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = add i32 [[RDX_NEXT_1]], [[RDX_NEXT]]
+; CHECK-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
+; CHECK:       [[LOOP_EPIL_PREHEADER]]:
+; CHECK-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT:    [[RDX_EPIL_INIT:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT:    [[LCMP_MOD2:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD2]])
+; CHECK-NEXT:    br label %[[LOOP_EPIL:.*]]
+; CHECK:       [[LOOP_EPIL]]:
+; CHECK-NEXT:    [[GEP_A_EPIL:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_EPIL_INIT]]
+; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[GEP_A_EPIL]], align 2
+; CHECK-NEXT:    [[RDX_NEXT_EPIL:%.*]] = sub i32 [[RDX_EPIL_INIT]], [[TMP4]]
+; CHECK-NEXT:    br label %[[EXIT]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i32 [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ], [ [[RDX_NEXT_EPIL]], %[[LOOP_EPIL]] ]
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %rdx = phi i32 [ 0, %entry ], [ %rdx.next, %loop ]
+  %gep.a = getelementptr inbounds nuw i32, ptr %a, i64 %iv
+  %1 = load i32, ptr %gep.a, align 2
+  %rdx.next = sub i32 %rdx, %1
+  %iv.next = add nuw nsw i64 %iv, 1
+  %ec = icmp eq i64 %iv.next, %n
+  br i1 %ec, label %exit, label %loop, !llvm.loop !0
+
+exit:
+  %res = phi i32 [ %rdx.next, %loop ]
+  ret i32 %res
+}
+
+define float @test_fsub_reduction(ptr %a, i64 %n) {
+;
+; CHECK-LABEL: define float @test_fsub_reduction(
+; CHECK-SAME: ptr [[A:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT:    [[XTRAITER:%.*]] = and i64 [[N]], 1
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
+; CHECK:       [[ENTRY_NEW]]:
+; CHECK-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX_1:%.*]] = phi float [ -0.000000e+00, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX:%.*]] = phi float [ 0.000000e+00, %[[ENTRY_NEW]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = load float, ptr [[GEP_A]], align 16
+; CHECK-NEXT:    [[RDX_NEXT]] = fsub reassoc float [[RDX]], [[TMP2]]
+; CHECK-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[IV_NEXT]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load float, ptr [[GEP_A_1]], align 16
+; CHECK-NEXT:    [[RDX_NEXT_1]] = fsub reassoc float [[RDX_1]], [[TMP3]]
+; CHECK-NEXT:    [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
+; CHECK-NEXT:    [[NITER_NEXT_1]] = add i64 [[NITER]], 2
+; CHECK-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
+; CHECK-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK:       [[EXIT_UNR_LCSSA]]:
+; CHECK-NEXT:    [[RES_PH:%.*]] = phi float [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RDX_UNR:%.*]] = phi float [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = fadd reassoc float [[RDX_NEXT_1]], [[RDX_NEXT]]
+; CHECK-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
+; CHECK:       [[LOOP_EPIL_PREHEADER]]:
+; CHECK-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT:    [[RDX_EPIL_INIT:%.*]] = phi float [ 0.000000e+00, %[[ENTRY]] ], [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT:    [[LCMP_MOD2:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD2]])
+; CHECK-NEXT:    br label %[[LOOP_EPIL:.*]]
+; CHECK:       [[LOOP_EPIL]]:
+; CHECK-NEXT:    [[GEP_A_EPIL:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[IV_EPIL_INIT]]
+; CHECK-NEXT:    [[TMP4:%.*]] = load float, ptr [[GEP_A_EPIL]], align 16
+; CHECK-NEXT:    [[RDX_NEXT_EPIL:%.*]] = fsub reassoc float [[RDX_EPIL_INIT]], [[TMP4]]
+; CHECK-NEXT:    br label %[[EXIT]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi float [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ], [ [[RDX_NEXT_EPIL]], %[[LOOP_EPIL]] ]
+; CHECK-NEXT:    ret float [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %rdx = phi float [ 0.0, %entry ], [ %rdx.next, %loop ]
+  %gep.a = getelementptr inbounds nuw float, ptr %a, i64 %iv
+  %1 = load float, ptr %gep.a, align 16
+  %rdx.next = fsub reassoc float %rdx, %1
+  %iv.next = add nuw nsw i64 %iv, 1
+  %ec = icmp eq i64 %iv.next, %n
+  br i1 %ec, label %exit, label %loop, !llvm.loop !0
+
+exit:
+  %res = phi float [ %rdx.next, %loop ]
+  ret float %res
+}
+
 !0 = distinct !{!0, !1}
 !1 = !{!"llvm.loop.unroll.count", i32 2}
 
@@ -500,4 +636,6 @@ exit:
 ; CHECK: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]]}
 ; CHECK: [[LOOP7]] = distinct !{[[LOOP7]], [[META1]]}
 ; CHECK: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]]}
+; CHECK: [[LOOP9]] = distinct !{[[LOOP9]], [[META1]]}
+; CHECK: [[LOOP10]] = distinct !{[[LOOP10]], [[META1]]}
 ;.

Add tests for unrolling loops with sub and fsub reductions when adding
parallel reduction phis. The final reduction value is currently computed
by chaining the partial results with the recurrence opcode (sub/fsub),
which flips the sign of every other partial result, e.g. computing
p1 - p0 instead of p1 + p0 for an unroll count of 2. The autogenerated
checks capture the current incorrect combines; a follow-up fixes them.

Assisted-by: Claude Code
@MarshallOfSound MarshallOfSound force-pushed the unroll-sub-reduction-combine-fix branch from ad3d0eb to 4a073bf Compare June 2, 2026 09:08
Comment on lines +1539 to +1543
unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RK);
if (RdxOpcode == Instruction::Sub)
RdxOpcode = Instruction::Add;
else if (RdxOpcode == Instruction::FSub)
RdxOpcode = Instruction::FAdd;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use something like

          Instruction::BinaryOps Opcode =
              RecurrenceDescriptor::isSubRecurrenceKind(RK)
                  ? getSubRecurOpcode(RK)
                  : (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LV, there's also the following, more compact comment:

           // For sub-recurrences, each part's reduction variable is already 
          // negative, we need to do: reduce.add(-acc_uf0 + -acc_uf1)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done — matched the LV pattern and comment (adapted since the partials are scalars here)

When introducing parallel reduction phis during unrolling, each partial
accumulator starts at the recurrence's identity value and applies the
loop's updates (including any subtractions) independently, so each
partial result already carries the correct sign. The final result must
therefore always sum the partial results.

The combine currently uses the opcode of the recurrence kind, which is
sub for RecurKind::Sub and fsub for RecurKind::FSub. Chaining the
partial results with sub computes pN - (... - (p1 - p0)), flipping the
sign of every other partial result and miscompiling the reduction.

Observed in the wild on AArch64 (Apple Silicon) release builds using
ThinLTO and PGO, where a hand-written NEON sub-form reduction (counting
bytes >= 0x80, as in simdutf's arm64 utf8_length_from_latin1) is hot
enough to be unrolled with parallel reduction phis.

Fixes llvm#201065

Assisted-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LoopUnroll] Parallel reduction phis miscompile sub-form reductions: partials recombined with alternating signs

2 participants