ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element by EgorBo · Pull Request #126790 · dotnet/runtime

EgorBo · 2026-04-11T15:10:21Z

In many places within the BCL, we need to obtain the exact index of the first match resulting from vector comparison operations. This is relatively cheap on x64 thanks to the movemask instruction; however, fully emulating movemask on ARM64 is quite slow, necessitating the use of alternative optimizations.

Example:

static int IndexOfAny(ref byte haystack, byte needle1, byte needle2)
{
    var data = Vector128.LoadUnsafe(ref haystack);
    var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
    var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
    return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
}

Current codegen:

            ldr     q16, [x0]
            uxtb    w0, w1
            dup     v17.16b, w0
            cmeq    v17.16b, v16.16b, v17.16b
            uxtb    w0, w2
            dup     v18.16b, w0
            cmeq    v16.16b, v16.16b, v18.16b
            orr     v16.16b, v17.16b, v16.16b

            ;; old logic:
            mvni    v17.4s, #0
            cmeq    v16.16b, v16.16b, v17.16b
            movi    v17.16b, #0x80
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD00]
            ushl    v16.16b, v16.16b, v17.16b
            uxtl2   v17.8h, v16.16b
            shl     v17.8h, v17.8h, #8
            uaddw   v16.8h, v17.8h, v16.8b
            addv    h16, v16.8h
            umov    w0, v16.h[0]
            rbit    w0, w0
            clz     w0, w0

            movn    w1, #0
            cmp     w0, #32
            csel    w0, w0, w1, ne

New codegen:

            ldr     q16, [x0]
            uxtb    w0, w1
            dup     v17.16b, w0
            cmeq    v17.16b, v16.16b, v17.16b
            uxtb    w0, w2
            dup     v18.16b, w0
            cmeq    v16.16b, v16.16b, v18.16b
            orr     v16.16b, v17.16b, v16.16b

            ;; new logic:
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x1, x0
            clz     x1, x1
            lsr     w1, w1, #2

            movn    w2, #0
            cmp     x0, #0
            csel    w0, w1, w2, ne

dotnet-policy-service · 2026-04-11T15:11:27Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR improves CoreCLR JIT ARM64 codegen for Vector128.IndexOfWhereAllBitsSet, targeting the common pattern where compare results are vectors of 0 / AllBitsSet and the goal is to quickly find the first matching lane index.

Changes:

Add ARM64-specific import support for Vector128.IndexOfWhereAllBitsSet and a Rationalizer rewrite that emits a faster SHRN+CTZ-style sequence when the input is proven to be 0/AllBitsSet.
Factor ARM64 ExtractMostSignificantBits lowering into a reusable helper (ExpandExtractMostSignificantBitsArm64) so it can be shared by both ExtractMostSignificantBits and the generic fallback for IndexOfWhereAllBitsSet.
Add a VN-based fold hook to mark IndexOfWhereAllBitsSet inputs as 0/AllBitsSet (via a new GTF_HW_INPUT_ZERO_OR_ALLBITS flag), enabling the optimized path.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/coreclr/jit/rationalize.h	Declares new ARM64 Rationalizer helpers for the rewrite/expansion.
src/coreclr/jit/rationalize.cpp	Implements the ARM64 rewrite for `IndexOfWhereAllBitsSet` and refactors EMSB expansion into a shared helper.
src/coreclr/jit/hwintrinsiclistarm64.h	Registers `Vector128.IndexOfWhereAllBitsSet` as an ARM64 helper intrinsic (special import, no direct codegen).
src/coreclr/jit/hwintrinsicarm64.cpp	Adds ARM64 special import for `NI_Vector128_IndexOfWhereAllBitsSet` (integral only; float/double fall back to managed).
src/coreclr/jit/gentree.h	Introduces `GTF_HW_INPUT_ZERO_OR_ALLBITS` for HWIntrinsic nodes.
src/coreclr/jit/compiler.h	Adds `optVNBasedFoldExpr_HWIntrinsic` declaration (guarded by HW intrinsics).
src/coreclr/jit/assertionprop.cpp	Adds VN-based folding to set `GTF_HW_INPUT_ZERO_OR_ALLBITS` when the input tree is provably `0`/`AllBitsSet`.

src/coreclr/jit/assertionprop.cpp

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

src/coreclr/jit/compiler.h

src/coreclr/jit/assertionprop.cpp

src/coreclr/jit/rationalize.cpp

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

src/coreclr/jit/rationalize.cpp

EgorBo · 2026-04-11T17:22:42Z

@EgorBot -arm -azure_arm -aws_arm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

public class Benchmarks
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
    }

    byte[] data = [0,0,0,11,0,0,0,12,0,0,0,13,0,0,0,14 ];

    [Benchmark]
    public int IndexOfAnyBench() => IndexOfAny(ref MemoryMarshal.GetReference(data), 13, 14);

    [Benchmark]
    public int LastIndexOfBench() => LastIndexOf(ref MemoryMarshal.GetReference(data), 13);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int IndexOfAny(ref byte haystack, byte needle1, byte needle2)
    {
        var data = Vector128.LoadUnsafe(ref haystack);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
        return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int LastIndexOf(ref byte haystack, byte needle1)
    {
        var data = Vector128.LoadUnsafe(ref haystack);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        return Vector128.IndexOfWhereAllBitsSet(cmp1);
    }
}

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

src/coreclr/jit/rationalize.cpp

src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs

src/coreclr/jit/rationalize.cpp

src/coreclr/jit/assertionprop.cpp

EgorBo · 2026-04-11T22:46:03Z

@EgorBot -arm -aws_arm -azure_arm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;

public class Benchmarks
{
    byte[] data1   = [0,0,0,11,0,0,0,12,0,0,0,13,0,0,0,14];
    ushort[] data2 = [0,0,0,11,0,0,0,12];
    byte[] data3 = new byte[1024];
    byte needle1 = 11;

    [Benchmark]
    public int IndexOfAnyBench() => IndexOfAny(data2, 13, 14);

    [Benchmark]
    public int IndexOfLoopBench()
    {
        Span<byte> span = data3;
        var val = Vector128.Create((byte)111);
        for (int i = 0; i < span.Length; i+= 16)
        {
            Vector128<byte> data = Vector128.Create(span);
            Vector128<byte> cmp = Vector128.Equals(data, val);
            int idx = Vector128.IndexOfWhereAllBitsSet(cmp);
            if (idx != -1)
            {
                return idx + i;
            }
        }

        return -1;
    }

    [Benchmark]
    public int LastIndexOfBench_Found() => LastIndexOf(data1, needle1);

    [Benchmark]
    public int LastIndexOfBench_NotFound() => LastIndexOf(data1, 42);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int IndexOfAny(Span<ushort> span, ushort needle1, ushort needle2)
    {
        var data = Vector128.Create(span);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        var cmp2 = Vector128.Equals(data, Vector128.Create(needle2));
        return Vector128.IndexOfWhereAllBitsSet(cmp1 | cmp2);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int LastIndexOf(Span<byte> span, byte needle1)
    {
        var data = Vector128.Create(span);
        var cmp1 = Vector128.Equals(data, Vector128.Create(needle1));
        return Vector128.LastIndexOfWhereAllBitsSet(cmp1);
    }
}

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

src/coreclr/jit/rationalize.cpp

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

src/coreclr/jit/assertionprop.cpp

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

src/coreclr/jit/assertionprop.cpp

src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs

Copilot AI review requested due to automatic review settings April 11, 2026 15:10

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2026

dotnet-policy-service bot assigned EgorBo Apr 11, 2026

Copilot started reviewing on behalf of EgorBo April 11, 2026 15:11 View session

EgorBo mentioned this pull request Apr 11, 2026

Optimize IndexOfAnyAsciiSearcher on Arm64 #126678

Merged

Copilot AI reviewed Apr 11, 2026

View reviewed changes

src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings April 11, 2026 16:50

MihuBot mentioned this pull request Apr 11, 2026

[JitDiff ARM64] [EgorBo] Optimize Vector128.IndexOfWhereAllBitsSet for arm64 MihuBot/runtime-utils#1846

Open

Copilot started reviewing on behalf of EgorBo April 11, 2026 16:51 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

src/coreclr/jit/compiler.h Outdated Show resolved Hide resolved

src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

EgorBo changed the title ~~Optimize Vector128.IndexOfWhereAllBitsSet for arm64~~ Optimize Vector128.IndexOfWhereAllBitsSet and LastIndexOfWhereAllBitsSet for arm64 Apr 11, 2026

Copilot AI review requested due to automatic review settings April 11, 2026 17:15

Copilot started reviewing on behalf of EgorBo April 11, 2026 17:17 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

EgorBot mentioned this pull request Apr 11, 2026

Benchmarks for dotnet/runtime#126790 (for @EgorBo) EgorBot/Benchmarks#115

Open

EgorBo changed the title ~~Optimize Vector128.IndexOfWhereAllBitsSet and LastIndexOfWhereAllBitsSet for arm64~~ ARM64: Optimize ExtractMostSignificantBits when input is 0/AllBitsSet per element Apr 11, 2026

EgorBo force-pushed the optimize-IndexOfWhereAllBitsSet branch from 2d6828a to 3e7ca22 Compare April 11, 2026 18:34

Copilot AI review requested due to automatic review settings April 11, 2026 18:47

Copilot started reviewing on behalf of EgorBo April 11, 2026 18:48 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

src/coreclr/jit/rationalize.cpp Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs Show resolved Hide resolved

Copilot AI review requested due to automatic review settings April 11, 2026 21:24

Copilot started reviewing on behalf of EgorBo April 11, 2026 21:25 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved

EgorBot mentioned this pull request Apr 11, 2026

Benchmarks for dotnet/runtime#126790 (for @EgorBo) EgorBot/Benchmarks#116

Open

Copilot AI review requested due to automatic review settings April 12, 2026 00:51

Copilot started reviewing on behalf of EgorBo April 12, 2026 00:52 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Apr 12, 2026

System.Net.NameResolution.Tests DNS failures: Name or service not known #126641

Open

EgorBo marked this pull request as ready for review April 12, 2026 14:29

Copilot AI review requested due to automatic review settings April 12, 2026 14:29

EgorBo changed the title ~~ARM64: Optimize ExtractMostSignificantBits when input is 0/AllBitsSet per element~~ ARM64: Optimize IndexOfWhereAllBitsSet when input is 0/AllBitsSet per element Apr 12, 2026

Copilot started reviewing on behalf of EgorBo April 12, 2026 14:30 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved

EgorBo added 2 commits April 12, 2026 16:54

test

ff61bcb

optimize NI_Vector128_IndexOfWhereAllBitsSet

df819a9

EgorBo force-pushed the optimize-IndexOfWhereAllBitsSet branch from e6d4f07 to df819a9 Compare April 12, 2026 15:16

EgorBo requested a review from Copilot April 12, 2026 15:36

Copilot started reviewing on behalf of EgorBo April 12, 2026 15:37 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved

src/tests/JIT/HardwareIntrinsics/General/HwiOp/IndexOfWhereAllBitsSet.cs Show resolved Hide resolved

EgorBo added 2 commits April 12, 2026 18:05

fb

6fbcba7

Merge branch 'main' into optimize-IndexOfWhereAllBitsSet

fdb32d3

Conversation

EgorBo commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EgorBo commented Apr 11, 2026 •

edited

Loading