[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging)#116032
[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging)#116032
Conversation
Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred, merging) or MOV (imm, pred, zeroing) as appropriate.
|
@llvm/pr-subscribers-backend-aarch64 Author: Ricardo Jesus (rj-jesus) ChangesAdd patterns to fold MOV (scalar, predicated) to MOV (imm, pred, Full diff: https://github.com/llvm/llvm-project/pull/116032.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index c10653e05841cd..d0b4b71a93f641 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -892,6 +892,26 @@ let Predicates = [HasSVEorSME] in {
def : Pat<(nxv2i64 (splat_vector (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)))),
(DUP_ZI_D $a, $b)>;
+ // Duplicate Int immediate to active vector elements (zeroing).
+ def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_B $pg, $a, $b)>;
+ def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_H $pg, $a, $b)>;
+ def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_S $pg, $a, $b)>;
+ def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+ (CPY_ZPzI_D $pg, $a, $b)>;
+
+ // Duplicate Int immediate to active vector elements (merging).
+ def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (nxv16i8 ZPR:$z))),
+ (CPY_ZPmI_B $z, $pg, $a, $b)>;
+ def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (nxv8i16 ZPR:$z))),
+ (CPY_ZPmI_H $z, $pg, $a, $b)>;
+ def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (nxv4i32 ZPR:$z))),
+ (CPY_ZPmI_S $z, $pg, $a, $b)>;
+ def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (nxv2i64 ZPR:$z))),
+ (CPY_ZPmI_D $z, $pg, $a, $b)>;
+
// Duplicate immediate FP into all vector elements.
def : Pat<(nxv2f16 (splat_vector (f16 fpimm:$val))),
(DUP_ZR_H (MOVi32imm (bitcast_fpimm_to_i32 f16:$val)))>;
diff --git a/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
new file mode 100644
index 00000000000000..43be70c9590fb0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+
+; Zeroing.
+
+define dso_local <vscale x 16 x i8> @mov_z_b(<vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_z_b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> %pg, i8 1)
+ ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_z_h(<vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_z_h:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.h, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> zeroinitializer, <vscale x 8 x i1> %pg, i16 1)
+ ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_z_s(<vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_z_s:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.s, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> zeroinitializer, <vscale x 4 x i1> %pg, i32 1)
+ ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_z_d(<vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_z_d:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.d, p0/z, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> %pg, i64 1)
+ ret <vscale x 2 x i64> %r
+}
+
+; Merging.
+
+define dso_local <vscale x 16 x i8> @mov_m_b(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_m_b:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.b, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg, i8 1)
+ ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_m_h(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_m_h:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.h, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg, i16 1)
+ ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_m_s(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_m_s:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.s, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg, i32 1)
+ ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_m_d(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_m_d:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z0.d, p0/m, #1 // =0x1
+; CHECK-NEXT: ret
+ %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg, i64 1)
+ ret <vscale x 2 x i64> %r
+}
+
+declare <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i1>, i8)
+declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i1>, i16)
+declare <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i1>, i32)
+declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i1>, i64)
|
paulwalker-arm
left a comment
There was a problem hiding this comment.
Can you say where the uses are coming from? I ask because I'm wondering if we should instead be lowering the intrinsics to select instructions to allow generic combines to do their thing. Plus we already have patterns to spot the select based sequences.
The uses are coming from ACLE intrinsics as in the description above. Would you rather see it done with a select in which generates: instead of: So, we could probably address all these cases by going the select way. What do you think? |
|
Thanks for the clarification. I just wasn't sure if the example was a real reproducer or just a means to demonstrate the problem. Given this is ACLE related, let's park the decision to replace The missed optimisation of the all active case isn't too surprising given there's already an unpredicated builtin that should be used instead. That said, I see no harm in updating |
|
Thanks for the feedback, I've now pushed the patterns into As for the unpredicated case, does it make sense to make that change here as well? |
paulwalker-arm
left a comment
There was a problem hiding this comment.
A minor quibble with the tests but otherwise this looks good to me.
I recommend fixing it as a separate PR. |
Sounds good, thanks, and many thanks for the feedback! |
Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.
This affects the
@llvm.aarch64.sve.dupintrinsics, which currentlygenerate the MOV (scalar, predicated) instructions, even when the
immediate forms are possible. For example:
Currently generates:
Instead of:
https://godbolt.org/z/GPzrW1nY8