Skip to content

ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element#126790

Open
EgorBo wants to merge 4 commits intodotnet:mainfrom
EgorBo:optimize-IndexOfWhereAllBitsSet
Open

ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element#126790
EgorBo wants to merge 4 commits intodotnet:mainfrom
EgorBo:optimize-IndexOfWhereAllBitsSet

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented Apr 11, 2026

In many places within the BCL, we need to obtain the exact index of the first match resulting from vector comparison operations. This is relatively cheap on x64 thanks to the movemask instruction; however, fully emulating movemask on ARM64 is quite slow, necessitating the use of alternative optimizations.

Example:

static int IndexOfAny(ref byte haystack, byte needle1, byte needle2)
{
    var data = Vector128.LoadUnsafe(ref haystack);
    var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
    var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
    return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
}

Current codegen:

            ldr     q16, [x0]
            uxtb    w0, w1
            dup     v17.16b, w0
            cmeq    v17.16b, v16.16b, v17.16b
            uxtb    w0, w2
            dup     v18.16b, w0
            cmeq    v16.16b, v16.16b, v18.16b
            orr     v16.16b, v17.16b, v16.16b

            ;; old logic:
            mvni    v17.4s, #0
            cmeq    v16.16b, v16.16b, v17.16b
            movi    v17.16b, #0x80
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD00]
            ushl    v16.16b, v16.16b, v17.16b
            uxtl2   v17.8h, v16.16b
            shl     v17.8h, v17.8h, #8
            uaddw   v16.8h, v17.8h, v16.8b
            addv    h16, v16.8h
            umov    w0, v16.h[0]
            rbit    w0, w0
            clz     w0, w0

            movn    w1, #0
            cmp     w0, #32
            csel    w0, w0, w1, ne

New codegen:

            ldr     q16, [x0]
            uxtb    w0, w1
            dup     v17.16b, w0
            cmeq    v17.16b, v16.16b, v17.16b
            uxtb    w0, w2
            dup     v18.16b, w0
            cmeq    v16.16b, v16.16b, v18.16b
            orr     v16.16b, v17.16b, v16.16b

            ;; new logic:
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x1, x0
            clz     x1, x1
            lsr     w1, w1, #2

            movn    w2, #0
            cmp     x0, #0
            csel    w0, w1, w2, ne

Copilot AI review requested due to automatic review settings April 11, 2026 15:10
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CoreCLR JIT ARM64 codegen for Vector128.IndexOfWhereAllBitsSet, targeting the common pattern where compare results are vectors of 0 / AllBitsSet and the goal is to quickly find the first matching lane index.

Changes:

  • Add ARM64-specific import support for Vector128.IndexOfWhereAllBitsSet and a Rationalizer rewrite that emits a faster SHRN+CTZ-style sequence when the input is proven to be 0/AllBitsSet.
  • Factor ARM64 ExtractMostSignificantBits lowering into a reusable helper (ExpandExtractMostSignificantBitsArm64) so it can be shared by both ExtractMostSignificantBits and the generic fallback for IndexOfWhereAllBitsSet.
  • Add a VN-based fold hook to mark IndexOfWhereAllBitsSet inputs as 0/AllBitsSet (via a new GTF_HW_INPUT_ZERO_OR_ALLBITS flag), enabling the optimized path.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/coreclr/jit/rationalize.h Declares new ARM64 Rationalizer helpers for the rewrite/expansion.
src/coreclr/jit/rationalize.cpp Implements the ARM64 rewrite for IndexOfWhereAllBitsSet and refactors EMSB expansion into a shared helper.
src/coreclr/jit/hwintrinsiclistarm64.h Registers Vector128.IndexOfWhereAllBitsSet as an ARM64 helper intrinsic (special import, no direct codegen).
src/coreclr/jit/hwintrinsicarm64.cpp Adds ARM64 special import for NI_Vector128_IndexOfWhereAllBitsSet (integral only; float/double fall back to managed).
src/coreclr/jit/gentree.h Introduces GTF_HW_INPUT_ZERO_OR_ALLBITS for HWIntrinsic nodes.
src/coreclr/jit/compiler.h Adds optVNBasedFoldExpr_HWIntrinsic declaration (guarded by HW intrinsics).
src/coreclr/jit/assertionprop.cpp Adds VN-based folding to set GTF_HW_INPUT_ZERO_OR_ALLBITS when the input tree is provably 0/AllBitsSet.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

@EgorBo EgorBo changed the title Optimize Vector128.IndexOfWhereAllBitsSet for arm64 Optimize Vector128.IndexOfWhereAllBitsSet and LastIndexOfWhereAllBitsSet for arm64 Apr 11, 2026
Copilot AI review requested due to automatic review settings April 11, 2026 17:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented Apr 11, 2026

@EgorBot -arm -azure_arm -aws_arm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

public class Benchmarks
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
    }

    byte[] data = [0,0,0,11,0,0,0,12,0,0,0,13,0,0,0,14 ];

    [Benchmark]
    public int IndexOfAnyBench() => IndexOfAny(ref MemoryMarshal.GetReference(data), 13, 14);

    [Benchmark]
    public int LastIndexOfBench() => LastIndexOf(ref MemoryMarshal.GetReference(data), 13);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int IndexOfAny(ref byte haystack, byte needle1, byte needle2)
    {
        var data = Vector128.LoadUnsafe(ref haystack);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
        return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int LastIndexOf(ref byte haystack, byte needle1)
    {
        var data = Vector128.LoadUnsafe(ref haystack);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        return Vector128.IndexOfWhereAllBitsSet(cmp1);
    }
}

@EgorBo EgorBo changed the title Optimize Vector128.IndexOfWhereAllBitsSet and LastIndexOfWhereAllBitsSet for arm64 ARM64: Optimize ExtractMostSignificantBits when input is 0/AllBitsSet per element Apr 11, 2026
@EgorBo EgorBo force-pushed the optimize-IndexOfWhereAllBitsSet branch from 2d6828a to 3e7ca22 Compare April 11, 2026 18:34
Copilot AI review requested due to automatic review settings April 11, 2026 18:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings April 11, 2026 21:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented Apr 11, 2026

@EgorBot -arm -aws_arm -azure_arm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;

public class Benchmarks
{
    byte[] data1   = [0,0,0,11,0,0,0,12,0,0,0,13,0,0,0,14];
    ushort[] data2 = [0,0,0,11,0,0,0,12];
    byte[] data3 = new byte[1024];
    byte needle1 = 11;

    [Benchmark]
    public int IndexOfAnyBench() => IndexOfAny(data2, 13, 14);

    [Benchmark]
    public int IndexOfLoopBench()
    {
        Span<byte> span = data3;
        var val = Vector128.Create((byte)111);
        for (int i = 0; i < span.Length; i+= 16)
        {
            Vector128<byte> data = Vector128.Create(span);
            Vector128<byte> cmp = Vector128.Equals(data, val);
            int idx = Vector128.IndexOfWhereAllBitsSet(cmp);
            if (idx != -1)
            {
                return idx + i;
            }
        }

        return -1;
    }

    [Benchmark]
    public int LastIndexOfBench_Found() => LastIndexOf(data1, needle1);

    [Benchmark]
    public int LastIndexOfBench_NotFound() => LastIndexOf(data1, 42);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int IndexOfAny(Span<ushort> span, ushort needle1, ushort needle2)
    {
        var data = Vector128.Create(span);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
        return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int LastIndexOf(Span<byte> span, byte needle1)
    {
        var data = Vector128.Create(span);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        return Vector128.LastIndexOfWhereAllBitsSet(cmp1);
    }
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

@EgorBo EgorBo marked this pull request as ready for review April 12, 2026 14:29
Copilot AI review requested due to automatic review settings April 12, 2026 14:29
@EgorBo EgorBo changed the title ARM64: Optimize ExtractMostSignificantBits when input is 0/AllBitsSet per element ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element Apr 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

@EgorBo EgorBo force-pushed the optimize-IndexOfWhereAllBitsSet branch from e6d4f07 to df819a9 Compare April 12, 2026 15:16
@EgorBo EgorBo requested a review from Copilot April 12, 2026 15:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants