Optimise MathF.CopySign and Math.CopySign using SSE intrinsics#35456
Optimise MathF.CopySign and Math.CopySign using SSE intrinsics#35456tannergooding merged 22 commits intodotnet:masterfrom
Conversation
Wouldn't this only apply to the |
I rechecked the comment about this, and yep, only the double method was meant to be restricted, which i have now fixed. We determined a double constant had worse codegen on x64 tho |
|
Interesting, LLVM implements it via |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.Helpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.Helpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.Helpers.cs
Outdated
Show resolved
Hide resolved
| // including NaN, so we operate on the raw bits. | ||
| long xbits = BitConverter.DoubleToInt64Bits(x); | ||
| long ybits = BitConverter.DoubleToInt64Bits(y); | ||
|
|
There was a problem hiding this comment.
How does the perf of this implementation compare to the original one on x86?
There was a problem hiding this comment.
Across the board, intrinsic is consistently fastest, and the branchless impl is fastest on all except Scenarios.Same for float, and Scenarios.Same and Scenarios.Alternating for double. However, it is <10us slower in these cases, and still >3x faster in Scenarios.Random, which is probably the most common scenario, so at the moment this PR uses the branchless impl for software fallback
There was a problem hiding this comment.
I think the question was explicitly with regards to 32-bit x86, rather than x64. The numbers will need to be collected independently.
There was a problem hiding this comment.
for random/unpredictable it is ~175us -> ~50us
There was a problem hiding this comment.
Could you share the before/after disasm for both x86 and x64?
There was a problem hiding this comment.
x64 - branched is original, branchless is new
CopySign_Branched(Double, Double)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: vmovsd [rsp+0x10], xmm0
L000d: mov rax, [rsp+0x10]
L0012: vmovsd [rsp+0x8], xmm1
L0018: mov rdx, [rsp+0x8]
L001d: xor rdx, rax
L0020: test rdx, rdx
L0023: jge L0040
L0025: mov rdx, 0x8000000000000000
L002f: xor rax, rdx
L0032: mov [rsp], rax
L0036: vmovsd xmm0, qword [rsp]
L003b: add rsp, 0x18
L003f: ret
L0040: add rsp, 0x18
L0044: ret
CopySign_Branchless(Double, Double)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: vmovsd [rsp+0x10], xmm0
L000d: mov rax, [rsp+0x10]
L0012: vmovsd [rsp+0x8], xmm1
L0018: mov rdx, [rsp+0x8]
L001d: mov rcx, 0x8000000000000000
L0027: and rdx, rcx
L002a: mov rcx, 0x7fffffffffffffff
L0034: and rax, rcx
L0037: or rax, rdx
L003a: mov [rsp], rax
L003e: vmovsd xmm0, qword [rsp]
L0043: add rsp, 0x18
L0047: retCopySign_Branched(Single, Single)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: vmovss [rsp+0x14], xmm0
L000d: mov eax, [rsp+0x14]
L0011: vmovss [rsp+0x10], xmm1
L0017: mov edx, [rsp+0x10]
L001b: xor edx, eax
L001d: test edx, edx
L001f: jge L0035
L0021: xor eax, 0x80000000
L0026: mov [rsp+0xc], eax
L002a: vmovss xmm0, dword [rsp+0xc]
L0030: add rsp, 0x18
L0034: ret
L0035: add rsp, 0x18
L0039: ret
CopySign_Branchless(Single, Single)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: vmovss [rsp+0x14], xmm0
L000d: mov eax, [rsp+0x14]
L0011: vmovss [rsp+0x10], xmm1
L0017: mov edx, [rsp+0x10]
L001b: and edx, 0x80000000
L0021: and eax, 0x7fffffff
L0026: or eax, edx
L0028: mov [rsp+0xc], eax
L002c: vmovss xmm0, dword [rsp+0xc]
L0032: add rsp, 0x18
L0036: ret
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128Helpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128Helpers.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs
Outdated
Show resolved
Hide resolved
|
@john-h-k I think you accidentally rebased master onto your branch rather than vice versa. You can rebase your range of commits on upstream/master and it should go away I think. |
|
Oops - disregard! Just saw the huge list of commits and didn't realize they were yours. |
|
It was an accidental rebase, I'm working with John on Discord to get it fixed 😄 |
…r perf and add integral constant for fallback
As discussed w tanner
4f73ae9 to
783362d
Compare
iirc at one point i was on the move and made a few of those commits on a phone hence the split file committing 😬 if you put a monkey in a cage with a keyboard it would probably be, on average, more successful with git than me thank you very much tanner for helping solve it 😄 |
I bet you haven't mangled all history and then force pushed to master in someone else's repo (that's how I learned 😸 ) |
wow that hurt to read 😬 i'll take this as a learning experience that merging a branch with itself ends badly 😄 |
|
i did not intentionally close this but damn i am really proving my incompotence with git |
|
Thanks for the contribution @john-h-k ! Perhaps you'd be interested in one of our other up-for-grabs issues? |
|
Oh, I guess ther'es still dotnet/performance#1298 |
Got that merged 😄 , so yes definitely |
|
Great, check out issues with that label and let us know if you need help! |
Move of dotnet/coreclr#26506
Optimise 'Math.CopySign' and 'MathF.CopySign'. Both of these methods can be improved from their current implementation. The new implementation uses SSE intrinsics which are faster, as well as having a faster intrinsic-free branch.
The SSE pathway doesn't spill to the stack, unlike the others, and is branch free. The fallback is also branch free but does spill to the stack (the current implementation spills and contains a branch). These are faster in every scenario, except for the non-SSE fallback for double precision being marginally (10%) slower on x64 in the Same scenario - however, windows (and x64) requires SSE2 so it is unlikely this code is ever going to be run on x86. I haven't profiled on ARM as I don't have access to an ARM system.
Benchmark sources here
Scenarios:
Same - every sign is the same (e.g x == 1f, y == 2f)
Different - every sign is different (e.g x == 1f, y == -2f)
Alternating - alternates between same and different
Random - randomly same or different
Methods:
Standard - current impls
John - impls with branches removed (still non-intrinsic)
John_intrinsic - SSE impls
Single precision (MathF):
Double precision (Math):
Across the board, intrinsic is consistently fastest, and the branchless impl is fastest on all except
Scenarios.Sameforfloat, andScenarios.SameandScenarios.Alternatingfordouble. However, it is <10us slower in these cases, and still >3x faster inScenarios.Random, which is probably the most common scenario, so at the moment this PR uses the branchless impl for software fallbackDue to issues with x86 intrinsic codegen, it was decided to limit the intrinsic path to x64
Microbenchmark PR at dotnet/performance#1298