[LV] Disable fold tail by masking - when induction vars used outside#81609
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-llvm-transforms Author: Niwin Anto (niwinanto) ChangesWhen induction variable are used outside the loop body, tail folding by masking mis-compiles. Full diff: https://github.com/llvm/llvm-project/pull/81609.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 37a356c43e29a4..d33743e74cbe31 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1552,6 +1552,19 @@ bool LoopVectorizationLegality::prepareToFoldTailByMasking() {
}
}
+ for (const auto &Entry : getInductionVars()) {
+ PHINode *OrigPhi = Entry.first;
+ for (User *U : OrigPhi->users()) {
+ auto *UI = cast<Instruction>(U);
+ if (!TheLoop->contains(UI)) {
+ LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking, loop IV has an "
+ "outside user for "
+ << *UI << "\n");
+ return false;
+ }
+ }
+ }
+
// The list of pointers that we can safely read and write to remains empty.
SmallPtrSet<Value *, 8> SafePointers;
diff --git a/llvm/test/Transforms/LoopVectorize/no-fold-tail-by-masking-iv-external-uses.ll b/llvm/test/Transforms/LoopVectorize/no-fold-tail-by-masking-iv-external-uses.ll
new file mode 100644
index 00000000000000..f7379df934bd77
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/no-fold-tail-by-masking-iv-external-uses.ll
@@ -0,0 +1,85 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -passes=loop-vectorize -S | FileCheck %s
+
+
+; #include <stdio.h>
+; #define SIZE 17
+;
+; unsigned char result;
+; unsigned char arr_1[SIZE];
+;
+; __attribute__((__noinline__))
+; void test(int limit, unsigned char val, int arr_2[SIZE][SIZE][SIZE]) {
+; #pragma clang loop vectorize_predicate(enable)
+; for (short i_5 = 0; i_5 < limit; i_5++) {
+; arr_1 [i_5] = val;
+; result = arr_2[0][0][i_5] != arr_2[i_5][i_5][0];
+; }
+; }
+;
+;int main(void) {
+; int arr_2[SIZE][SIZE][SIZE];
+;
+; __builtin_memset(arr_2, 1, sizeof(arr_2));
+;
+; test(SIZE, 0, arr_2);
+; printf("%hu \n", result);
+;}
+; clang miss-compiles the above code
+; with vectorize_predicate(enable), result is 0 and 1 without.
+
+
+@result = global i8 0, align 1
+@arr_17 = global [17 x i8] zeroinitializer, align 1
+@a = external global i8, align 1
+
+define void @test(i32 %limit, i8 zeroext %val, ptr readonly %arr_14) {
+; CHECK-LABEL: @test(
+; CHECK-NOT: pred.store.if:
+; CHECK-NOT: pred.store.continue:
+;
+entry:
+ %cmp18 = icmp sgt i32 %limit, 0
+ br i1 %cmp18, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader: ; preds = %entry
+ br label %for.body
+
+for.cond.for.cond.cleanup_crit_edge: ; preds = %for.body
+ %conv20.lcssa = phi i32 [ %conv20, %for.body ]
+ %arrayidx4 = getelementptr inbounds [17 x i32], ptr %arr_14, i32 0, i32 %conv20.lcssa
+ %0 = load i32, ptr %arrayidx4, align 4, !tbaa !4
+ %arrayidx8 = getelementptr inbounds [17 x [17 x i32]], ptr %arr_14, i32 %conv20.lcssa, i32 %conv20.lcssa
+ %1 = load i32, ptr %arrayidx8, align 4, !tbaa !4
+ %cmp10 = icmp ne i32 %0, %1
+ %conv11 = zext i1 %cmp10 to i8
+ store i8 %conv11, ptr @result, align 1, !tbaa !8
+ br label %for.cond.cleanup
+
+for.cond.cleanup: ; preds = %for.cond.for.cond.cleanup_crit_edge, %entry
+ ret void
+
+for.body: ; preds = %for.body.preheader, %for.body
+ %conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ]
+ %i_5.019 = phi i16 [ %inc, %for.body ], [ 0, %for.body.preheader ]
+ %arrayidx = getelementptr inbounds [17 x i8], ptr @arr_17, i32 0, i32 %conv20
+ store i8 %val, ptr %arrayidx, align 1, !tbaa !8
+ %inc = add i16 %i_5.019, 1
+ %conv = sext i16 %inc to i32
+ %cmp = icmp slt i32 %conv, %limit
+ br i1 %cmp, label %for.body, label %for.cond.for.cond.cleanup_crit_edge, !llvm.loop !9
+}
+
+
+
+!4 = !{!5, !5, i64 0}
+!5 = !{!"int", !6, i64 0}
+!6 = !{!"omnipotent char", !7, i64 0}
+!7 = !{!"Simple C++ TBAA"}
+!8 = !{!6, !6, i64 0}
+!9 = distinct !{!9, !10, !11, !12, !13, !14}
+!10 = !{!"llvm.loop.mustprogress"}
+!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
+!12 = !{!"llvm.loop.vectorize.width", i32 2}
+!13 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}
+!14 = !{!"llvm.loop.vectorize.enable", i1 true}
|
fhahn
left a comment
There was a problem hiding this comment.
Thanks for the patch!
Could you add the test as a separate PR (with a FIXME); this patch then just adjust the test and the diff shows the change in the test only.
Previously there was a patch shared here https://reviews.llvm.org/D115109 by @rickyz (hope it's the same as on Phabricator) but the patch never got pushed through. Would be good to look at the comments and potentially pick it up
| br label %for.body | ||
|
|
||
| for.cond.for.cond.cleanup_crit_edge: ; preds = %for.body | ||
| %conv20.lcssa = phi i32 [ %conv20, %for.body ] |
There was a problem hiding this comment.
I think the test can be simplified by just returning %conv20.lcssa here
| ret void | ||
|
|
||
| for.body: ; preds = %for.body.preheader, %for.body | ||
| %conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ] |
There was a problem hiding this comment.
Does the issue reproduce if all uses of %conv20 are replaced by i_5.019?
|
|
||
| for.body: ; preds = %for.body.preheader, %for.body | ||
| %conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ] | ||
| %i_5.019 = phi i16 [ %inc, %for.body ], [ 0, %for.body.preheader ] |
There was a problem hiding this comment.
Can the phi be changed to i32, so the sext in the loop isn't needed?
| %conv20 = phi i32 [ %conv, %for.body ], [ 0, %for.body.preheader ] | ||
| %i_5.019 = phi i16 [ %inc, %for.body ], [ 0, %for.body.preheader ] | ||
| %arrayidx = getelementptr inbounds [17 x i8], ptr @arr_17, i32 0, i32 %conv20 | ||
| store i8 %val, ptr %arrayidx, align 1, !tbaa !8 |
| ; CHECK-NOT: pred.store.continue: | ||
| ; | ||
| entry: | ||
| %cmp18 = icmp sgt i32 %limit, 0 |
There was a problem hiding this comment.
nit: the check and branch shouldn't be needed.
| ;int main(void) { | ||
| ; int arr_2[SIZE][SIZE][SIZE]; | ||
| ; | ||
| ; __builtin_memset(arr_2, 1, sizeof(arr_2)); |
There was a problem hiding this comment.
Usually we don't include C/C++ source code, as the IR usually needs to stand on its own. Below are a few suggestions to further simplify the IR and make it more readable.
It would be helpful if you could instead a brief comment explaining the issue.
|
|
||
| define void @test(i32 %limit, i8 zeroext %val, ptr readonly %arr_14) { | ||
| ; CHECK-LABEL: @test( | ||
| ; CHECK-NOT: pred.store.if: |
There was a problem hiding this comment.
This is quite fragile; some existing tests use CHECK-NOT: vector.body: to check for not vectorizing.
|
|
||
|
|
||
|
|
||
| !4 = !{!5, !5, i64 0} |
There was a problem hiding this comment.
nodes used by tbaa shouldn't be needed after dropping !tbaa
| @@ -0,0 +1,85 @@ | |||
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py | |||
| ; RUN: opt < %s -passes=loop-vectorize -S | FileCheck %s | |||
There was a problem hiding this comment.
As this is added as a target-independent test, it probably needs something like -force-vector-width=4 -force-vector-interleave=1 to make sure the vectorizer tries to vectorize independent of the cost-model.
Thanks @fhahn for the reviews. Great that you mentioned the Phabricator patch, the test looks good and I copied here. As you suggested, created new pr for the test case with default behavior(niwinanto@33ec308) and then updated this pr. However, I messed with the git workflow(I think). Could you please take a look, this is what you intended. |
|
Thank you @niwinanto for picking this up (and apologies for letting the change languish for so long despite @fhahn's helpful comments!) |
Yeah that looks good, I'll add a few small additional comments. But best to create a separate PR to just add the test case showing the issue first. |
@fhahn I am exactly trying to create a separate PR. niwinanto#2. May be you can help me to figure out what I am doing wrong. I am extremely sorry, getting used to the new workflow. As you suggested, I created a new commit with different branch and created new PR(for test as mentioned above). For some reason it contain the commit from this PR, which I tried to remove by dropping in interactive re-base and forced push. Also, addressed feedback regarding the tests. |
Looking at https://github.com/niwinanto/llvm-project/pull/2/commits, it looks like there's a single commit adding the test, so that looks good I think? Could you update the destination branch to be upstream llvm-project's |
|
|
@fhahn Updated the PR to adjust the changes after merging the test early. |
fhahn
left a comment
There was a problem hiding this comment.
LGTM, thanks!
I adjusted the description of the PR a bit to add a few more details.
|
@niwinanto Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested Please check whether problems have been caused by your change specifically, as How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
However we still have a restriction that IVs can't have outside users. This was added separately to the AllowedExit restriction in llvm#81609, but it looks like llvm#149042 didn't remove it. AFAICT we currently extract the correct lane for IVs, so this PR relaxes the restriction. This helps a good few loops get tail folded in llvm-test-suite. -force-tail-folding-style=none was added to pr5881-scev-expansion.ll to preserve the original scev expansion, since otherwise we end up with a cttz.elts(false, false, true, true) that blocks SCEV analysis. We should probably teach ConstantFolding to fold it.
#149042 added last-active-lane and removed the restriction that we couldn't tail fold loops that had outside users (in AllowedExit). However we still have a restriction that IVs can't have outside users. This was added separately to the AllowedExit restriction in #81609, but it looks like #149042 didn't remove it. AFAICT we currently extract the correct lane for IVs, so this PR relaxes the restriction. This helps a good few loops get tail folded in llvm-test-suite. -force-tail-folding-style=none was added to pr5881-scev-expansion.ll to preserve the original scev expansion, since otherwise we end up with a cttz.elts(false, false, true, true) that blocks SCEV analysis. We should probably teach ConstantFolding to fold it.
When induction variable are used outside the loop body, tail folding
by masking mis-compiles, because for users outside of the loop the
final value of the induction is computed separately from the vector
loop.
Fixes #76069
Fixes #51677