ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element#126790
Open
EgorBo wants to merge 4 commits intodotnet:mainfrom
Open
ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element#126790EgorBo wants to merge 4 commits intodotnet:mainfrom
EgorBo wants to merge 4 commits intodotnet:mainfrom
Conversation
Contributor
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves CoreCLR JIT ARM64 codegen for Vector128.IndexOfWhereAllBitsSet, targeting the common pattern where compare results are vectors of 0 / AllBitsSet and the goal is to quickly find the first matching lane index.
Changes:
- Add ARM64-specific import support for
Vector128.IndexOfWhereAllBitsSetand a Rationalizer rewrite that emits a faster SHRN+CTZ-style sequence when the input is proven to be0/AllBitsSet. - Factor ARM64
ExtractMostSignificantBitslowering into a reusable helper (ExpandExtractMostSignificantBitsArm64) so it can be shared by bothExtractMostSignificantBitsand the generic fallback forIndexOfWhereAllBitsSet. - Add a VN-based fold hook to mark
IndexOfWhereAllBitsSetinputs as0/AllBitsSet(via a newGTF_HW_INPUT_ZERO_OR_ALLBITSflag), enabling the optimized path.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/rationalize.h | Declares new ARM64 Rationalizer helpers for the rewrite/expansion. |
| src/coreclr/jit/rationalize.cpp | Implements the ARM64 rewrite for IndexOfWhereAllBitsSet and refactors EMSB expansion into a shared helper. |
| src/coreclr/jit/hwintrinsiclistarm64.h | Registers Vector128.IndexOfWhereAllBitsSet as an ARM64 helper intrinsic (special import, no direct codegen). |
| src/coreclr/jit/hwintrinsicarm64.cpp | Adds ARM64 special import for NI_Vector128_IndexOfWhereAllBitsSet (integral only; float/double fall back to managed). |
| src/coreclr/jit/gentree.h | Introduces GTF_HW_INPUT_ZERO_OR_ALLBITS for HWIntrinsic nodes. |
| src/coreclr/jit/compiler.h | Adds optVNBasedFoldExpr_HWIntrinsic declaration (guarded by HW intrinsics). |
| src/coreclr/jit/assertionprop.cpp | Adds VN-based folding to set GTF_HW_INPUT_ZERO_OR_ALLBITS when the input tree is provably 0/AllBitsSet. |
Member
Author
|
@EgorBot -arm -azure_arm -aws_arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
public class Benchmarks
{
static void Main(string[] args)
{
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
}
byte[] data = [0,0,0,11,0,0,0,12,0,0,0,13,0,0,0,14 ];
[Benchmark]
public int IndexOfAnyBench() => IndexOfAny(ref MemoryMarshal.GetReference(data), 13, 14);
[Benchmark]
public int LastIndexOfBench() => LastIndexOf(ref MemoryMarshal.GetReference(data), 13);
[MethodImpl(MethodImplOptions.NoInlining)]
static int IndexOfAny(ref byte haystack, byte needle1, byte needle2)
{
var data = Vector128.LoadUnsafe(ref haystack);
var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int LastIndexOf(ref byte haystack, byte needle1)
{
var data = Vector128.LoadUnsafe(ref haystack);
var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
return Vector128.IndexOfWhereAllBitsSet(cmp1);
}
} |
2d6828a to
3e7ca22
Compare
src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs
Show resolved
Hide resolved
src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs
Show resolved
Hide resolved
Member
Author
|
@EgorBot -arm -aws_arm -azure_arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
public class Benchmarks
{
byte[] data1 = [0,0,0,11,0,0,0,12,0,0,0,13,0,0,0,14];
ushort[] data2 = [0,0,0,11,0,0,0,12];
byte[] data3 = new byte[1024];
byte needle1 = 11;
[Benchmark]
public int IndexOfAnyBench() => IndexOfAny(data2, 13, 14);
[Benchmark]
public int IndexOfLoopBench()
{
Span<byte> span = data3;
var val = Vector128.Create((byte)111);
for (int i = 0; i < span.Length; i+= 16)
{
Vector128<byte> data = Vector128.Create(span);
Vector128<byte> cmp = Vector128.Equals(data, val);
int idx = Vector128.IndexOfWhereAllBitsSet(cmp);
if (idx != -1)
{
return idx + i;
}
}
return -1;
}
[Benchmark]
public int LastIndexOfBench_Found() => LastIndexOf(data1, needle1);
[Benchmark]
public int LastIndexOfBench_NotFound() => LastIndexOf(data1, 42);
[MethodImpl(MethodImplOptions.NoInlining)]
static int IndexOfAny(Span<ushort> span, ushort needle1, ushort needle2)
{
var data = Vector128.Create(span);
var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int LastIndexOf(Span<byte> span, byte needle1)
{
var data = Vector128.Create(span);
var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
return Vector128.LastIndexOfWhereAllBitsSet(cmp1);
}
} |
e6d4f07 to
df819a9
Compare
src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In many places within the BCL, we need to obtain the exact index of the first match resulting from vector comparison operations. This is relatively cheap on x64 thanks to the movemask instruction; however, fully emulating movemask on ARM64 is quite slow, necessitating the use of alternative optimizations.
Example:
Current codegen:
New codegen: