-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsOverviewThis PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use Vector128Case 1: Without PR With PR Case 2: Without PR With PR Case 3: Without PR With PR Case 4: Without PR With PR Case 5: Without PR With PR Case 6: Without PR With PR Vector256Case 1: Without PR With PR Case 2: Without PR With PR Case 3: Without PR With PR Case 4: Without PR With PR Case 5: Without PR With PR Case 6: Without PR With PR Vector512Case 1: Without PR With PR Case 2: Without PR With PR Case 3: Without PR With PR Case 4: Without PR With PR Case 5: Without PR With PR Case 6: Without PR With PR Instructions and dependenciesVector128 8 bits integer 16 bits integer 32 bits integer 64 bits integer float double Vector256 All integer types All float types Vector512 All integer types All float types
|
26e8e12 to
2408ec9
Compare
|
Does new implementation change the resulting value compared to the previous? Due to IEEE754 rules |
The resulting values should remain the same. I see some test failures. Will verify/fix those before making PR 'ReadyForReview' re Feel free to let me know if you see any issues here. |
|
@EgorBo The test failures confuse me. They are with AVX512F disabled. I looked at the disassembly and the disasm is the same with my change vs main. I'll look at this but pointing it out incase I'm missing anything obvious With AVX512F Enabled - This passes |
216f087 to
044aebe
Compare
src/coreclr/jit/hwintrinsicxarch.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What support is missing for long on 32-bit?
I would have expected this to generally "just work" and for us to be able to use _ToScalar provided SSE4.1 is supported given that GetElement has the required decomposition support for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a failure with decomposition. I tried again and passes locally. I enabled that code to get it to run on CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the right way to handle this - 9e3657f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't quite work unfortunately and the actual handling is going to be a little bit more complex.
Noting that I'm fine with this being handled as a separate PR (and am happy to do that work, seeing as I'm pretty sure I know what will need to be done here), I had initially thought it might be slightly simpler but looks like its not quite all there..
The "proper" fix likely entails:
- Extracting this logic to a
gtNewSimdToScalarNodehelper: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L2917-L2937 - Updating various places that are manually doing
gtNewSimdHWIntrinsicNode(retType, op1, NI_Vector###_ToScalar, simdBaseJitType, simdSize)to call the introduced helper - Have the
impSpecialIntrinsichandling forNI_Vector###_Sumdo:
#if defined(TARGET_X86)
else if (varTypeIsLong(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_SSE41))
{
// We need SSE41 to handle long, use software fallback
break;
}
#endif // TARGET_X86At some future point we can ensure that NI_Vector###_GetElement has decomp handling for pre-SSE4.1 as well, it's just slightly more complex given that TYP_INT currently requires SSE4.1 as well. The fix to get that working is likely to just reuse the TYP_FLOAT handling for getting the relevant 32-bit part and then using ToScalar (shifting can then be used for the relevant 8-bit or 16-bit part for byte/short).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the implementation to use the helper. Waiting for CI tests to pass. My local tests passed
df322c6 to
6e77487
Compare
6e77487 to
9e3657f
Compare
d7a3c62 to
6b3eec7
Compare
6b3eec7 to
76e307e
Compare
| // | ||
| GenTree* Compiler::gtNewSimdToScalarNode(var_types type, GenTree* op1, CorInfoType simdBaseJitType, unsigned simdSize) | ||
| { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray empty line:
|
CC. @dotnet/jit-contrib, this needs a secondary review but should be good to merge otherwise |
…wering for Vector512 (dotnet#95568) * Updating Sum() implementation. * Fixing codegen * Addressing review comments. * Fix Formatting * Enabling for long on x86. * Cleaning up ToScalar implementation

Overview
This PR upgrades Vector128/256/512 Sum() implementations. The existing Sum() implementations(Vector128 and Vector256) use
haddand are not the most efficient. This commit modifies the existing implementations and adds the Vector512 sum(0 implementationsVector128
Case 1:
byte/ubytetypesWithout PR
Not lowered
With PR
Case 2:
Int16typesWithout PR
With PR
Case 3:
Int32typesWithout PR
With PR
Case 4:
longtypesWithout PR
Not lowered
With PR
Case 5:
floattypesWithout PR
With PR
Case 6:
doubletypesWithout PR
With PR
Vector256
Case 1:
byte/ubytetypesWithout PR
Not lowered
With PR
Case 2:
Int16typesWithout PR
With PR
Case 3:
Int32typesWithout PR
With PR
Case 4:
longtypesWithout PR
Not lowered
With PR
Case 5:
floattypesWithout PR
With PR
Case 6:
doubletypesWithout PR
With PR
Vector512
Case 1:
byte/ubytetypesWithout PR
Not lowered
With PR
Case 2:
Int16typesWithout PR
Not lowered
With PR
Case 3:
Int32typesWithout PR
Not lowered
With PR
Case 4:
longtypesWithout PR
Not lowered
With PR
Case 5:
floattypesWithout PR
Not lowered
With PR
Case 6:
doubletypesWithout PR
Not lowered
With PR
Instructions and dependencies
Vector128
8 bits integer
vpsrldq- sse2vpaddb- sse216 bits integer
vpsrldq- sse2vpaddw- sse232 bits integer
vpsrldq- sse2vpaddd- sse264 bits integer
vpsrldq- sse2vpaddq- sse2float
shufps- sseaddps- ssedouble
shufpd- sseaddpd- sse2Vector256
the corresponding Vector128 instr + the following
All integer types
vextracti128- AVX2All float types
vextractf128- AVXVector512
the corresponding Vector128 instr + the following
All integer types
vextracti64x4- AVX512FAll float types
vextractf64x4- AVX512FPerformance numbers
On ICX
On SPR