Optimise 'Math.CopySign' and 'MathF.CopySign'#26506
Optimise 'Math.CopySign' and 'MathF.CopySign'#26506john-h-k wants to merge 10 commits intodotnet:masterfrom
Conversation
|
@danmosemsft
Looking at that right now - will make sure to do so |
The spill that happens on the non-SSE path is something that can be fixed in the JIT. I'm not sure if once that is done the SSE path would still be significantly faster to warrant the extra complexity as the ANDPS/ORPS instructions have lower throughput compared to the their integer counterparts. For benchmarking purposes it would be useful to test the non-SSE version with |
Though with 48B80000000000000080 mov rax, 0x8000000000000000
C4E1F96ED0 vmovd xmm2, rax
C5E854C9 vandps xmm1, xmm2, xmm1
C5E855C0 vandnps xmm0, xmm2, xmm0
C5F856C1 vorps xmm0, xmm0, xmm1which is probably as good as it gets. |
|
just for reference here is what mono generates for: [MethodImpl(MethodImplOptions.NoInlining)]
public static double Test(double x, double y) => Math.CopySign(x, y);asm: 0000000000000000 movabsq $0x7fb20121f3e0, %rax
000000000000000a vandps (%rax), %xmm1, %xmm1
000000000000000e movabsq $0x7fb20121f3f0, %rax
0000000000000018 vandps (%rax), %xmm0, %xmm0
000000000000001c vorps %xmm1, %xmm0, %xmm0
0000000000000020 retq( |
859b8d9 to
7b74bec
Compare
| // flip the sign bit of x and return the new value; | ||
| // otherwise, just return x | ||
| // Remove the sign from x, and remove everything but the sign from y | ||
| // Creating from a 'const long' is better than from the correct 'const float/double' |
There was a problem hiding this comment.
What was the performance difference between using Vector128.CreateScalarUnsafe(signMask).AsSingle(), which generates:
mov rax, 0x8000000000000000
vmovd xmm2, rax
and using Vector128.CreateScalarUnsafe(-0.0);, which generates:
vmovsd xmm0, dword [rip+0x15]
The latter has the load from memory, but smaller codegen, which I believe could be impactful to tight loops.
MSVC, GCC, and Clang all look to prefer the latter scenario...
There was a problem hiding this comment.
I tried both and in the naive-ish simple creation benchmark, the first appeared to be marginally faster so I chose it
There was a problem hiding this comment.
Also, now ANDNPS is used, there is only 1 load rather than 2
There was a problem hiding this comment.
the JIT doesn't seem to inline this method in its current state
This is probably because the IL for all code paths is sufficiently big to skip the initial heuristics.
What we've done for some of the other code is mark the method as AggressiveInlining and keep any code-paths we know are small enough inline (i.e. generally the intrinsic paths). The software fallback (or other paths which might not be "small enough") are put in their own local method so the JIT can independently determine whether or not to inline using only that method (another explanation of this is here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128.cs,11).
For example, https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128.cs,312.
Is it worth applying it to the method given the discussion around its use in tight loops
I would imagine so, the codegen is very small (5 instructions in the intrinsic case and nearly as small in the software fallback case), so the cost of the additional indirection might be greater than the cost of just running the code.
There was a problem hiding this comment.
So in this situation because both pathways are likely profitable to be inlined and the fallback is also small, just mark the whole method aggro inlining rather than separate it into a separate fallback method?
There was a problem hiding this comment.
I think it would be good to let the JIT make its own decision about the software fallback...
The code generated around the DoubleToInt64 bits is still non-ideal (I imagine the ARM codegen is similar, but I'll need to check):
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: vmovsd [rsp+0x10], xmm1
L000d: mov rax, [rsp+0x10]
L0012: vmovsd [rsp+0x8], xmm2
L0018: mov rdx, [rsp+0x8]
L001d: mov rcx, 0x8000000000000000
L0027: and rdx, rcx
L002a: mov rcx, 0x7fffffffffffffff
L0034: and rax, rcx
L0037: or rax, rdx
L003a: mov [rsp], rax
L003e: vmovsd xmm0, qword [rsp]
L0043: add rsp, 0x18
L0047: retIt would also be good to try and get rid of the second constant load here, having the following would be much better:
mov rcx, 0x8000000000000000
and rdx, rcx
- mov rcx, 0x7fffffffffffffff
+ not rcx
and rax, rcxIt looks like the C# compiler optimizes it to be two distinct constants in the case where signMask is declared a constant and the JIT does the same when it isn't.
There was a problem hiding this comment.
👍 for the separate method fallbacks. Currently trying a few things but am failing to trigger the JIT to generate the second style unfortunately
There was a problem hiding this comment.
It might be something the JIT doesn't support today, in which case logging a bug should be sufficient.
That would be greatly appreciated and would make validating this change on other boxes simpler 😄. In general, the benchmarks are being checked into the dotnet/performance repo. -- @adamsitnik, do we have any guidance on how to run dotnet/performance benchmarks against local dotnet/coreclr builds? Overall the changes LGTM, just want to test this on a few more boxes (and probably an ARM machine) before committing it. |
|
Testing on sharplab, the JIT doesn't seem to inline this method in its current state, but will with
|
Not sure where these vanished too, eek
didn't realise this when i wrapped the line :/
|
The SSE C5FB10442444 vmovsd xmm0, qword ptr [esp+44H]
C5F911442420 vmovupd xmmword ptr [esp+20H], xmm0
C5FB104C243C vmovsd xmm1, qword ptr [esp+3CH]
C5F9114C2410 vmovupd xmmword ptr [esp+10H], xmm1
6800000080 push 0x80000000
6A00 push 0
8D4C2408 lea ecx, bword ptr [esp+08H]
FF1514C08E05 call [System.Runtime.Intrinsics.Vector128:CreateScalarUnsafe(long):struct]
C5F9100424 vmovupd xmm0, xmmword ptr [esp]
C5F855442420 vandnps xmm0, xmm0, xmmword ptr [esp+20H]
C5F9104C2410 vmovupd xmm1, xmmword ptr [esp+10H]
C5F0540C24 vandps xmm1, xmm1, xmmword ptr [esp]
C5F856C1 vorps xmm0, xmm0, xmm1
C5FB11442430 vmovsd qword ptr [esp+30H], xmm0
DD442430 fld qword ptr [esp+30H]It probably should only be enabled on x64. |
Sounds good to me. It's worth noting that some of that looks like codegen issues we might want to track... There isn't really any reason why this couldn't be: vmovq xmm2, qword ptr [address of ulong constant] ; This encoding of movq is supported even on 32-bit
vandps xmm1, xmm2, xmm1
vandnps xmm0, xmm2, xmm0
vorps xmm0, xmm0, xmm1 |
5609a0d to
2c75f1b
Compare
yes, all you need to do is to pass the path to corerun executable via https://github.com/dotnet/performance/tree/master/src/benchmarks/micro#private-runtime-builds @john-h-k please let me know if something is not clear or does not work as expected |
|
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
|
Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime. |
(now in the correct repo 🥇)
Optimise 'Math.CopySign' and 'MathF.CopySign'. Both of these methods can be improved from their current implementation. The new implementation uses SSE intrinsics which are faster, as well as having a faster intrinsic-free branch.
The SSE pathway doesn't spill to the stack, unlike the others, and is branch free. The fallback is also branch free but does spill to the stack (the current implementation spills and contains a branch). These are faster in every scenario, except for the non-SSE fallback for double precision being marginally (10%) slower on x64 in the Same scenario - however, windows (and x64) requires SSE2 so it is unlikely this code is ever going to be run on x86. I haven't profiled on ARM as I don't have access to an ARM system.
Benchmark sources here
Scenarios:
Same - every sign is the same (e.g
x == 1f, y == 2f)Different - every sign is different (e.g
x == 1f, y == -2f)Alternating - alternates between same and different
Random - randomly same or different
Single precision (
MathF):Double precision (
Math):