[LoopUnroll] Combine partial sub/fsub reduction results with add/fadd#201066
[LoopUnroll] Combine partial sub/fsub reduction results with add/fadd#201066MarshallOfSound wants to merge 2 commits into
Conversation
|
cc @fhahn @juliannagele — would appreciate a review, this fixes a miscompile in the parallel reduction phi combining from #149470 / #166630. |
|
Hello @MarshallOfSound 👋 Thank you for submitting a Pull Request (PR) to the LLVM Project. Since this is your first PR, here are a few useful links covering our main contribution policies and review practices.
Please reply to this message to confirm that you have read these policies, especially the LLVM AI Tool Use Policy, and that any AI tool usage has been noted in the PR description. Frequently asked questionsHow do I add reviewers? This PR will be automatically labeled, and the relevant teams will be notified. For some parts of the project, reviewers may also be added automatically. You can also add reviewers manually using the Reviewers section on this page. If you cannot use that section, it is probably because you do not have write permissions for the repository. In that case, you can request a review by tagging reviewers in a comment using What if there are no comments? If you have not received any comments on your PR after a week, you can request a review by pinging the PR with a comment such as “Ping”. The common courtesy ping rate is once a week. Please remember that you are asking for volunteer time from other developers. Are any special GitHub settings required to contribute to LLVM? We only require contributors to have a public email address associated with their GitHub commits, see this section of LLVM Developer Policy for details. If you have questions, feel free to leave a comment on this PR, or ask on LLVM Discord or LLVM Discourse. Thank you, |
|
@llvm/pr-subscribers-llvm-transforms Author: Samuel Attard (MarshallOfSound) ChangesWhen introducing parallel reduction phis during unrolling (#149470), each partial accumulator starts at the recurrence's identity value and applies the loop's updates (including any subtractions) independently, so each partial result already carries the correct sign. The final result must therefore always sum the partial results. The combine currently uses the opcode of the recurrence kind, which is sub for Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO and PGO, where a hand-written NEON sub-form reduction (counting bytes >= 0x80, as in simdutf's arm64 The first commit adds tests for sub/fsub reductions showing the current incorrect combines; the second applies the fix. Fixes #201065 Full diff: https://github.com/llvm/llvm-project/pull/201066.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index ffb4d1037f0f2..8d4327f0b0c27 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -1533,10 +1533,20 @@ llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI,
IRBuilder Builder(ExitBlock, ExitBlock->getFirstNonPHIIt());
Builder.setFastMathFlags(Reductions.begin()->second.getFastMathFlags());
RecurKind RK = Reductions.begin()->second.getRecurrenceKind();
+ // Each partial reduction starts at the recurrence's identity value
+ // and applies the loop's updates (including any subtractions)
+ // independently, so each partial result already carries the correct
+ // sign. The partial results must always be summed, even for sub-form
+ // recurrences; chaining them with the recurrence opcode (sub/fsub)
+ // would flip the sign of every other partial result.
+ unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RK);
+ if (RdxOpcode == Instruction::Sub)
+ RdxOpcode = Instruction::Add;
+ else if (RdxOpcode == Instruction::FSub)
+ RdxOpcode = Instruction::FAdd;
for (Instruction *RdxPart : drop_begin(PartialReductions)) {
- RdxResult = Builder.CreateBinOp(
- (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK),
- RdxPart, RdxResult, "bin.rdx");
+ RdxResult = Builder.CreateBinOp((Instruction::BinaryOps)RdxOpcode,
+ RdxPart, RdxResult, "bin.rdx");
}
NeedToFixLCSSA = true;
for (Instruction *RdxPart : PartialReductions)
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll b/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll
index 840cc6c507c3d..40bfd86a52ef9 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll
+++ b/llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll
@@ -484,6 +484,142 @@ exit:
ret float %res
}
+define i32 @test_sub_reduction(ptr %a, i64 %n) {
+;
+; CHECK-LABEL: define i32 @test_sub_reduction(
+; CHECK-SAME: ptr [[A:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT: [[XTRAITER:%.*]] = and i64 [[N]], 1
+; CHECK-NEXT: [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
+; CHECK-NEXT: br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
+; CHECK: [[ENTRY_NEW]]:
+; CHECK-NEXT: [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[RDX_1:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[GEP_A]], align 2
+; CHECK-NEXT: [[RDX_NEXT]] = sub i32 [[RDX]], [[TMP2]]
+; CHECK-NEXT: [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT]]
+; CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[GEP_A_1]], align 2
+; CHECK-NEXT: [[RDX_NEXT_1]] = sub i32 [[RDX_1]], [[TMP3]]
+; CHECK-NEXT: [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
+; CHECK-NEXT: [[NITER_NEXT_1]] = add i64 [[NITER]], 2
+; CHECK-NEXT: [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
+; CHECK-NEXT: br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]], !llvm.loop [[LOOP9:![0-9]+]]
+; CHECK: [[EXIT_UNR_LCSSA]]:
+; CHECK-NEXT: [[RES_PH:%.*]] = phi i32 [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT: [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT: [[RDX_UNR:%.*]] = phi i32 [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT: [[BIN_RDX:%.*]] = add i32 [[RDX_NEXT_1]], [[RDX_NEXT]]
+; CHECK-NEXT: [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT: br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
+; CHECK: [[LOOP_EPIL_PREHEADER]]:
+; CHECK-NEXT: [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT: [[RDX_EPIL_INIT:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT: [[LCMP_MOD2:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT: call void @llvm.assume(i1 [[LCMP_MOD2]])
+; CHECK-NEXT: br label %[[LOOP_EPIL:.*]]
+; CHECK: [[LOOP_EPIL]]:
+; CHECK-NEXT: [[GEP_A_EPIL:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_EPIL_INIT]]
+; CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[GEP_A_EPIL]], align 2
+; CHECK-NEXT: [[RDX_NEXT_EPIL:%.*]] = sub i32 [[RDX_EPIL_INIT]], [[TMP4]]
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: [[RES:%.*]] = phi i32 [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ], [ [[RDX_NEXT_EPIL]], %[[LOOP_EPIL]] ]
+; CHECK-NEXT: ret i32 [[RES]]
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %rdx = phi i32 [ 0, %entry ], [ %rdx.next, %loop ]
+ %gep.a = getelementptr inbounds nuw i32, ptr %a, i64 %iv
+ %1 = load i32, ptr %gep.a, align 2
+ %rdx.next = sub i32 %rdx, %1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %ec = icmp eq i64 %iv.next, %n
+ br i1 %ec, label %exit, label %loop, !llvm.loop !0
+
+exit:
+ %res = phi i32 [ %rdx.next, %loop ]
+ ret i32 %res
+}
+
+define float @test_fsub_reduction(ptr %a, i64 %n) {
+;
+; CHECK-LABEL: define float @test_fsub_reduction(
+; CHECK-SAME: ptr [[A:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT: [[XTRAITER:%.*]] = and i64 [[N]], 1
+; CHECK-NEXT: [[TMP1:%.*]] = icmp ult i64 [[TMP0]], 1
+; CHECK-NEXT: br i1 [[TMP1]], label %[[LOOP_EPIL_PREHEADER:.*]], label %[[ENTRY_NEW:.*]]
+; CHECK: [[ENTRY_NEW]]:
+; CHECK-NEXT: [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[RDX_1:%.*]] = phi float [ -0.000000e+00, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[RDX:%.*]] = phi float [ 0.000000e+00, %[[ENTRY_NEW]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT: [[TMP2:%.*]] = load float, ptr [[GEP_A]], align 16
+; CHECK-NEXT: [[RDX_NEXT]] = fsub reassoc float [[RDX]], [[TMP2]]
+; CHECK-NEXT: [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[IV_NEXT]]
+; CHECK-NEXT: [[TMP3:%.*]] = load float, ptr [[GEP_A_1]], align 16
+; CHECK-NEXT: [[RDX_NEXT_1]] = fsub reassoc float [[RDX_1]], [[TMP3]]
+; CHECK-NEXT: [[IV_NEXT_1]] = add nuw nsw i64 [[IV]], 2
+; CHECK-NEXT: [[NITER_NEXT_1]] = add i64 [[NITER]], 2
+; CHECK-NEXT: [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
+; CHECK-NEXT: br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK: [[EXIT_UNR_LCSSA]]:
+; CHECK-NEXT: [[RES_PH:%.*]] = phi float [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT: [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT: [[RDX_UNR:%.*]] = phi float [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd reassoc float [[RDX_NEXT_1]], [[RDX_NEXT]]
+; CHECK-NEXT: [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT: br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
+; CHECK: [[LOOP_EPIL_PREHEADER]]:
+; CHECK-NEXT: [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT: [[RDX_EPIL_INIT:%.*]] = phi float [ 0.000000e+00, %[[ENTRY]] ], [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ]
+; CHECK-NEXT: [[LCMP_MOD2:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT: call void @llvm.assume(i1 [[LCMP_MOD2]])
+; CHECK-NEXT: br label %[[LOOP_EPIL:.*]]
+; CHECK: [[LOOP_EPIL]]:
+; CHECK-NEXT: [[GEP_A_EPIL:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[IV_EPIL_INIT]]
+; CHECK-NEXT: [[TMP4:%.*]] = load float, ptr [[GEP_A_EPIL]], align 16
+; CHECK-NEXT: [[RDX_NEXT_EPIL:%.*]] = fsub reassoc float [[RDX_EPIL_INIT]], [[TMP4]]
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: [[RES:%.*]] = phi float [ [[BIN_RDX]], %[[EXIT_UNR_LCSSA]] ], [ [[RDX_NEXT_EPIL]], %[[LOOP_EPIL]] ]
+; CHECK-NEXT: ret float [[RES]]
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %rdx = phi float [ 0.0, %entry ], [ %rdx.next, %loop ]
+ %gep.a = getelementptr inbounds nuw float, ptr %a, i64 %iv
+ %1 = load float, ptr %gep.a, align 16
+ %rdx.next = fsub reassoc float %rdx, %1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %ec = icmp eq i64 %iv.next, %n
+ br i1 %ec, label %exit, label %loop, !llvm.loop !0
+
+exit:
+ %res = phi float [ %rdx.next, %loop ]
+ ret float %res
+}
+
!0 = distinct !{!0, !1}
!1 = !{!"llvm.loop.unroll.count", i32 2}
@@ -500,4 +636,6 @@ exit:
; CHECK: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]]}
; CHECK: [[LOOP7]] = distinct !{[[LOOP7]], [[META1]]}
; CHECK: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]]}
+; CHECK: [[LOOP9]] = distinct !{[[LOOP9]], [[META1]]}
+; CHECK: [[LOOP10]] = distinct !{[[LOOP10]], [[META1]]}
;.
|
Add tests for unrolling loops with sub and fsub reductions when adding parallel reduction phis. The final reduction value is currently computed by chaining the partial results with the recurrence opcode (sub/fsub), which flips the sign of every other partial result, e.g. computing p1 - p0 instead of p1 + p0 for an unroll count of 2. The autogenerated checks capture the current incorrect combines; a follow-up fixes them. Assisted-by: Claude Code
ad3d0eb to
4a073bf
Compare
| unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RK); | ||
| if (RdxOpcode == Instruction::Sub) | ||
| RdxOpcode = Instruction::Add; | ||
| else if (RdxOpcode == Instruction::FSub) | ||
| RdxOpcode = Instruction::FAdd; |
There was a problem hiding this comment.
Better use something like
Instruction::BinaryOps Opcode =
RecurrenceDescriptor::isSubRecurrenceKind(RK)
? getSubRecurOpcode(RK)
: (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK);
There was a problem hiding this comment.
In LV, there's also the following, more compact comment:
// For sub-recurrences, each part's reduction variable is already
// negative, we need to do: reduce.add(-acc_uf0 + -acc_uf1)
There was a problem hiding this comment.
done — matched the LV pattern and comment (adapted since the partials are scalars here)
When introducing parallel reduction phis during unrolling, each partial accumulator starts at the recurrence's identity value and applies the loop's updates (including any subtractions) independently, so each partial result already carries the correct sign. The final result must therefore always sum the partial results. The combine currently uses the opcode of the recurrence kind, which is sub for RecurKind::Sub and fsub for RecurKind::FSub. Chaining the partial results with sub computes pN - (... - (p1 - p0)), flipping the sign of every other partial result and miscompiling the reduction. Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO and PGO, where a hand-written NEON sub-form reduction (counting bytes >= 0x80, as in simdutf's arm64 utf8_length_from_latin1) is hot enough to be unrolled with parallel reduction phis. Fixes llvm#201065 Assisted-by: Claude Code
4a073bf to
8a5f697
Compare
When introducing parallel reduction phis during unrolling (#149470), each partial accumulator starts at the recurrence's identity value and applies the loop's updates (including any subtractions) independently, so each partial result already carries the correct sign. The final result must therefore always sum the partial results.
The combine currently uses the opcode of the recurrence kind, which is sub for
RecurKind::Suband fsub forRecurKind::FSub(reachable via #166630 with reassoc). Chaining the partial results with sub computespN - (... - (p1 - p0)), flipping the sign of every other partial result and miscompiling the reduction.Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO and PGO, where a hand-written NEON sub-form reduction (counting bytes >= 0x80, as in simdutf's arm64
utf8_length_from_latin1) is hot enough to be unrolled with parallel reduction phis.The first commit adds tests for sub/fsub reductions showing the current incorrect combines; the second applies the fix.
Fixes #201065
Substantial parts of this patch were developed with the assistance of Claude Code (see the
Assisted-bytrailers) per the AI tool policy — I've reviewed all of it.