JIT: Improve codegen for Vector128/256.NarrowWithSaturation by saucecontrol · Pull Request #126226 · dotnet/runtime

saucecontrol · 2026-03-27T21:04:51Z

This adds some missing optimized paths for NarrowWithSaturation intrinsics in pre-AVX-512 environments.

Vector128.NarrowWithSaturation was fully accelerated for signed types but not unsigned:

static Vector128<ushort> NarrowSaturate(Vector128<uint> x, Vector128<uint> y)
	=> Vector128.NarrowWithSaturation(x, y);

        vbroadcastss xmm0, dword ptr [reloc @RWD00]
        vpminuw  xmm1, xmm0, xmmword ptr [rdx]
-       vpand    xmm1, xmm1, xmm0
-       vpminuw  xmm2, xmm0, xmmword ptr [r8]
-       vpand    xmm0, xmm2, xmm0
+       vpminuw  xmm0, xmm0, xmmword ptr [r8]
        vpackuswb xmm0, xmm1, xmm0
        vmovups  xmmword ptr [rcx], xmm0
        mov      rax, rcx
        ret      

 RWD00  	dd	00FF00FFh		; 2.34184e-38
 
-; Total bytes of code 39
+; Total bytes of code 31

Vector256.NarrowWithSaturation was using the slow path for both signed and unsigned:

static Vector256<sbyte> NarrowSaturate(Vector256<short> x, Vector256<short> y)
	=> Vector256.NarrowWithSaturation(x, y);

-       vbroadcastss ymm0, dword ptr [reloc @RWD00]
-       vpmaxsw  ymm1, ymm0, ymmword ptr [rdx]
-       vbroadcastss ymm2, dword ptr [reloc @RWD04]
-       vpminsw  ymm1, ymm1, ymm2
-       vbroadcastss ymm3, dword ptr [reloc @RWD08]
-       vpand    ymm1, ymm1, ymm3
-       vpmaxsw  ymm0, ymm0, ymmword ptr [r8]
-       vpminsw  ymm0, ymm0, ymm2
-       vpand    ymm0, ymm0, ymm3
-       vpackuswb ymm0, ymm1, ymm0
+       vmovups  ymm0, ymmword ptr [rdx]
+       vpacksswb ymm0, ymm0, ymmword ptr [r8]
        vpermq   ymm0, ymm0, -40
        vmovups  ymmword ptr [rcx], ymm0
        mov      rax, rcx
        vzeroupper 
        ret      

-RWD00  	dd	FF80FF80h		;      -nan
-RWD04  	dd	007F007Fh		; 1.16633e-38
-RWD08  	dd	00FF00FFh		; 2.34184e-38
 
-; Total bytes of code 73
+; Total bytes of code 26

The bulk of the code changes here are from a refactoring of the intrinsic lookup for various vector convert ops. I'm filling in some of the optimization gaps in these intrinsics, and the tangle of logic required to select the right intrinsic is spread out in various places. Having it in a shared method will make it easier to complete the other changes I have planned.

The refactor is split into the first commit, which is zero-diff. Second commit on include the codegen improvements.

Full diffs

dotnet-policy-service · 2026-03-27T21:05:58Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR refactors x86/x64 SIMD vector conversion intrinsic selection into a shared helper and adds missing fast paths for Vector128/256.NarrowWithSaturation in non-AVX512 environments, reducing instruction count and code size for several narrow-with-saturation cases.

Changes:

Introduce GenTreeHWIntrinsic::GetHWIntrinsicIdForVectorConvert(...) to centralize lookup of conversion-related intrinsics (including optional saturating preference).
Improve Vector128/256.NarrowWithSaturation codegen on pre-AVX512 machines by using pack-based sequences where applicable.
Refactor existing conversion/widen/narrow construction to use the shared lookup helper instead of duplicated switch logic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
src/coreclr/jit/hwintrinsicxarch.cpp	Uses the new conversion lookup helper and adds optimized pack-based paths for `NarrowWithSaturation` on non-AVX512.
src/coreclr/jit/gentree.h	Declares the new shared vector-convert intrinsic lookup helper.
src/coreclr/jit/gentree.cpp	Implements the helper and refactors several SIMD convert/narrow/widen paths to use it.

src/coreclr/jit/gentree.cpp

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

src/coreclr/jit/hwintrinsicxarch.cpp

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

saucecontrol · 2026-03-29T02:55:09Z

cc @dotnet/jit-contrib this is ready for review.

Diffs

tannergooding · 2026-04-06T16:59:56Z

src/coreclr/jit/gentree.h

+                                                    var_types sourceType,
+                                                    var_types targetType,
+                                                    unsigned  simdSize,
+                                                    bool      preferSaturating = false,


nit: I don't think prefer is a good name and its likely to cause confusions/bugs in the future.

We have two types of narrowing, truncation and saturation, where each is explicitly required for one of the publicly exposed APIs and so we need a way to emit each deterministically.

We then have an optimization where we want whatever is fastest under the presumption that everything is in range, which is functionally a NarrowEstimate API (going based on the naming we use elsewhere) and which is where the knowledge of "what is fastest" can be handled (i.e. just use the truncating APIs on AVX-512 hardware and the saturating ones downlevel).

I think we want to split this into those three, accordingly, and if anything this helper should simply tell you whether it gave back a saturating or truncating instruction; there shouldn't be any "prefer" mechanic because such a preference is going to be driven based on some need or optimization instead.

So... where I was going with this is that in quite a few places, we would prefer to have an instruction that saturates, but if there's an instruction that does the conversion but doesn't saturate, we can still use that and just the do saturation before or after. Or we may choose not to saturate. i.e. it really is a preference, with a choice to be made after we see what the lookup found.

What I was originally looking at was the floating->integral vector conversions, where we have e.g. ConvertToInt64Native and ConvertToInt64, with the latter guaranteeing a saturating conversion and the former making that undefined. The behaviors of those are not necessarily expected to be the same, but they're more inconsistent today than they need to be.

On Arm64, ConvertToInt64 and ConvertToInt64Native do the same thing, because the native instructions always saturate.

On x86/64 with AVX-512 plus, we saturate ConvertToInt64 but not ConvertToInt64Native (even though saturation is free on AVX10v2).

On x86/64 without AVX-512, we saturate both ConvertToInt64 and ConvertToInt64Native, because we fall back to a scalar cast loop for both, and the scalar cast saturates. But even this is inconsistent, because we don't saturate ConvertToInt32Native since that has a legacy native instruction and is handled as intrinsic.

What I envisioned was that we would 'prefer' a saturating conversion for both, which would change the behavior of Avx10v2 to match what we do on Arm64, because honestly, the saturation is just the most sensible behavior. I would then add emulation for the pre-AVX-512 paths (which, again, today fall back to unrolled scalar implementations), which would also saturate, because it's really no more expensive to saturate than it is to emulate exactly what the native instructions do. That emulation would be used for both ConvertToInt64 and ConvertToInt64Native.

This pattern also fits for NarrowSaturating, where we would 'prefer' a single instruction that does it directly, but would accept a non-saturating instruction since we can emulate that before narrowing.

Anyway, if you don't like that pattern, I'm fine to change it. I can still do what I wanted to do -- it would just be two instruction lookups in those places instead of one with a flag.

I should also say that under the model I envisioned, gtNewSimdCvtNode and gtNewSimdCvtNativeNode would roll up into the same thing, where the Native implementation is expressed only as a preference for a saturating instruction but would accept non-saturating if that's all we have available.

I think the issue is that's a fairly centric view to the floating-point conversions. In the inverse direction you have the normal integer to integer narrow operations where users typically expect truncation instead, outside of specific scenarios. It's also where Arm64 provides the truncating versions out of the box and where x64 provides faster truncating versions on modern hardware, so in that case you don't want to prefer saturation.

So I think it really comes down to "you require a behavior" or "you don't" and in the latter its typically you want whatever is fastest (and typically that's because you have some known invariant, which the JIT can't know, that its in range and so edges aren't relevant).

So I think the simplest thing to do here is to do what we do elsewhere, which is define Narrow (truncating), NarrowSaturating, and NarrowEstimate helpers. The first two do exactly what they're described to do, whatever way makes it correct. The latter then dispatches to one of the two based on the types provided and knowing whichever is faster for the underlying hardware (i.e. it may truncate, it may saturate, it may even change based on what ISAs are enabled/disabled). -- and in the case both are equally fast, then it prefers whatever is consistent with the non-estimate behavior (so truncation for integer->integer and saturation for float->integer, with integer->float only having one correct behavior which is saturating)

Ok, I like the way that sounds. Will see what it looks like in code.

saucecontrol added 2 commits March 27, 2026 12:49

centralize vector convert instruction selection

223369c

adding missing NarrowWithSaturation acceleration

89e8ccd

Copilot AI review requested due to automatic review settings March 27, 2026 21:04

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 27, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 27, 2026

Copilot started reviewing on behalf of saucecontrol March 27, 2026 21:05 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

This was referenced Mar 27, 2026

Suboptimal codgen for Vector128.NarrowWithSaturation #116526

Open

JIT: Skip redundant AND masking in NarrowWithSaturation codegen #122898

Open

saucecontrol added 2 commits March 27, 2026 15:31

address feedback, remove more unused code

3c800a1

fix unsigned saturation

dafb42e

Copilot AI review requested due to automatic review settings March 28, 2026 01:41

Copilot started reviewing on behalf of saucecontrol March 28, 2026 01:42 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

This was referenced Mar 28, 2026

XHarness package install failure on iOS due to devicectl NSPOSIXErrorDomain error 49 #123796

Open

modpowTest.FastReducer_AssertFailure_RegressionTest hanging/timing out #126212

Closed

saucecontrol added 2 commits March 27, 2026 19:29

feedback, tidying

4dc28fd

fix crossgen

73eca4b

Copilot AI review requested due to automatic review settings March 28, 2026 04:12

Copilot started reviewing on behalf of saucecontrol March 28, 2026 04:13 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

build-analysis bot mentioned this pull request Mar 28, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

tannergooding reviewed Apr 6, 2026

View reviewed changes

Conversation

saucecontrol commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

saucecontrol commented Mar 29, 2026

Uh oh!

tannergooding Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

saucecontrol Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

saucecontrol Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

tannergooding Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saucecontrol Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saucecontrol commented Mar 27, 2026 •

edited

Loading

tannergooding Apr 6, 2026 •

edited

Loading