Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics#37139
Conversation
|
@echesakovMSFT , @tannergooding , @TamarChristinaArm , @BruceForstall |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Show resolved
Hide resolved
|
Thanks @kunalspathak! These look great, for the As a general question, these Also the First one is a Unrelated question to your change, but why does it allocate so much stack space?
There's also a weird gap in between the stores Could it be thinking that internally all vectors are If they were placed next to eachother you could optimize these with |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Outdated
Show resolved
Hide resolved
| @@ -937,13 +937,23 @@ static Vector128<ulong> SoftwareFallback(ulong e0, ulong e1) | |||
| /// <returns>A new <see cref="Vector128{Byte}" /> initialized from <paramref name="lower" /> and <paramref name="upper" />.</returns> | |||
| public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper) | |||
There was a problem hiding this comment.
These methods should be aggressively inlined.
It might also be good to just treat them as intrinsic and to create the appropriate nodes in importation or lowering so these don't impact inlining of other methods due to increased IL size or additional locals introduced.
There was a problem hiding this comment.
I changed the implementation to import to appropriate instructions. Curious, what is preferable way to do these things? Importation or lowering and what are the advantages? One advantage I see doing it early is so the nodes are passed through other optimizations (if applicable).
That is actually the codegen for the fallback case (not relevant to ARM64 except for indirect calls, like via a delegate) and is due to the fallback using |
|
Sorry @TamarChristinaArm , I forgot to reply.
Correct. We already optimize
I believe so. There is a overall problem where arguments are pushed to stack and retrieved. Related #35635. Perhaps @CarolEidt might know exact reasons.
This has changed in this PR. See my updated comments in PR description.
Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap? |
It certainly seems as if the assignment of frame locations is allocating 16 bytes for the 8 byte vectors. It would take some investigation to figure out why. |
|
I think the problem is that |
|
Thanks @kunalspathak !
hmm why is it moving it between register files now though? I would have expected the same |
Yes, Initially I was doing that optimization of generating |
|
Opened #37429 to track the alignment questions that @TamarChristinaArm had. |
Optimizes following APIs using hardware intrinsics:
Vector128.WithLower()
WithLower_before.txt
WithLower_after.txt
Vector128.WithUpper
WithUpper_before.txt
WithUpper_after.txt
WithUpper_after_jit.txt
Vector64.AsDouble
AsDouble_before.txt
AsDouble_after.txt
Vector64.AsInt64
AsInt64_before.txt
AsInt64_after.txt
Vector.AsUInt64
AsUint64_before.txt
AsUint64_after.txt
Vector128.Create(Vector64, Vector64)
Create_before.txt
Create_after.txt
Update:
After talking to Tanner and Egor, we decided to not optimize the
Insert+Extractcombination in JIT, but do it in separate PR when we implementInsertSelectedScalarintrinsic. So currently, optimizeWithUpper,WithLowerandCreatein managed code. Here is the code we generate after this change:withlower_after_nojit.txt
withupper_after_nojit.txt
create_after_nojit.txt
Contributes to #33308 and #33496.