Porting additional SIMD Intrinsics to use SimdAsHWIntrinsic#37882
Porting additional SIMD Intrinsics to use SimdAsHWIntrinsic#37882tannergooding merged 25 commits intodotnet:masterfrom
Conversation
4044add to
e98213d
Compare
e98213d to
defea13
Compare
defea13 to
4105ab7
Compare
8ac8ead to
63ec8f0
Compare
bfb7472 to
d080102
Compare
|
CC. @CarolEidt, @echesakovMSFT |
|
Benchmarks x64 Frameworks x64 |
|
Working on getting perf numbers but this should resolve the following perf regression #37425 |
|
Most of the diffs are essentially the following: - vmovaps xmm1, xmm0
- vdpps xmm1, xmm0, 113
- vmovaps xmm0, xmm1
+ vdpps xmm0, xmm0, xmm0, 113There are also a few similar too: vmovmskps xrax, xmm0
- mov edx, 7
- and eax, edx
+ and eax, 7
Likewise a few that go from:
```diff
- mov eax, 0xD1FFAB1E
- vmovd xmm0, eax
- vpbroadcastd ymm0, ymm0
+ vmovupd ymm0, ymmword ptr[reloc @RWD32]A more extreme example: vxorps xmm0, xmm0
vmovdqu xmmword ptr [rsp+48H], xmm0
vmovdqu xmmword ptr [rsp+58H], xmm0
lea rcx, bword ptr [rsp+48H]
call Vector`1:.ctor(Vector`1):this
mov rcx, rdi
vmovdqu xmm0, xmmword ptr [rsp+48H]
vmovdqu xmmword ptr [rsp+28H], xmm0
vmovdqu xmm0, xmmword ptr [rsp+58H]
vmovdqu xmmword ptr [rsp+38H], xmm0
lea rdx, bword ptr [rsp+28H]
mov r8, rsi
call Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1
mov rax, rdiTo: vmovupd ymm0, ymmword ptr[rdx]
vbroadcastss ymm0, ymm0
vmovupd ymmword ptr[rsp+50H], ymm0
mov rcx, rsi
vmovupd ymm0, ymmword ptr[rsp+50H]
vmovupd ymmword ptr[rsp+20H], ymm0
lea rdx, bword ptr [rsp+20H]
call Vector`1:op_Multiply(Vector`1,Vector`1):Vector`1
mov rax, rsiThe |
There was a problem hiding this comment.
Fully removing SIMDIntrinsicInit and removing gtGetSIMDZero requires a bit more work. I logged #37043 as the more general issue.
There was a problem hiding this comment.
I think that, beyond this PR, any other improvements should likely hold off for .NET 6.
|
CC. @dotnet/jit-contrib This should be ready for review. |
CarolEidt
left a comment
There was a problem hiding this comment.
Mostly minor comment stuff.
There was a problem hiding this comment.
It seems like it would make more sense to just split out this case (as it was previously), even with a separate check for the long types.
There was a problem hiding this comment.
Maybe the confusing part here is that the AVX check isn't checking for some instruction support, it's just a "we support Vector256<T>" check.
The instruction emitted is the same for 128 or 256-bit, it's just that we support Vector256<T> if AVX is supported, so the code really is identical (we even emit the register access as a 128-bit access)
There was a problem hiding this comment.
My point is that (AFAICT) the previous calls to compExactlyDependsOn are unnecessary for the NI_Vector256_ToScalar case. So perhaps just checking for that first would make more sense.
There was a problem hiding this comment.
That's generally true except for the SSE2_X64 case, as we won't have AVX_X64 until #38460 goes in (although you also can't disable SSE2.X64, you can only disable SSE2 itself).
If duplicated, the compExactlyDependsOn(SSE2) checks would become compExactlyDependsOn(AVX) and the compExactlyDependsOn(SSE2_X64) check would become compExactlyDependsOn(AVX) && compExactlyDependsOn(SSE2_X64). Once #38460 is merged, it could be simplified to just compExactlyDependsOn(AVX_X64)
There was a problem hiding this comment.
So is it correct that only the SSE2_X64 check is non-redundant for the AVX case? It still seems confusing to me, and a lot of unnecessary checks for that case.
There was a problem hiding this comment.
Right, only the SSE2_X64 case is non redundant. I just pushed a fix that breaks it apart and which should make that distinction clearer.
…AllOnes, and SIMDIntrinsicGetZero
…UBLE and fixing VectorT128_get_One
Co-authored-by: Carol Eidt <carol.eidt@microsoft.com>
6db4ec3 to
3ac2b4a
Compare
CarolEidt
left a comment
There was a problem hiding this comment.
LGTM - thanks for the comments and code shufflings!
This ports additional intrinsics such as
SIMDIntrinsicInit,SIMDIntrinsicGetOne, andSIMDIntrinsicDot