Optimise MathF.CopySign and Math.CopySign using SSE intrinsics by john-h-k · Pull Request #35456 · dotnet/runtime

john-h-k · 2020-04-25T14:04:43Z

Optimise 'Math.CopySign' and 'MathF.CopySign'. Both of these methods can be improved from their current implementation. The new implementation uses SSE intrinsics which are faster, as well as having a faster intrinsic-free branch.

The SSE pathway doesn't spill to the stack, unlike the others, and is branch free. The fallback is also branch free but does spill to the stack (the current implementation spills and contains a branch). These are faster in every scenario, except for the non-SSE fallback for double precision being marginally (10%) slower on x64 in the Same scenario - however, windows (and x64) requires SSE2 so it is unlikely this code is ever going to be run on x86. I haven't profiled on ARM as I don't have access to an ARM system.

Benchmark sources here

Scenarios:
Same - every sign is the same (e.g x == 1f, y == 2f)
Different - every sign is different (e.g x == 1f, y == -2f)
Alternating - alternates between same and different
Random - randomly same or different

Methods:
Standard - current impls
John - impls with branches removed (still non-intrinsic)
John_intrinsic - SSE impls

Single precision (MathF):

Method	Scenario	Mean	Error	StdDev
Standard	Random	171.66 us	0.5269 us	0.4929 us
John	Random	47.51 us	0.1071 us	0.1002 us
John_Intrinsic	Random	39.08 us	0.0543 us	0.0508 us
Standard	Same	43.31 us	0.0900 us	0.0842 us
John	Same	47.53 us	0.1130 us	0.1057 us
John_Intrinsic	Same	39.04 us	0.0369 us	0.0288 us
Standard	Different	54.78 us	0.1070 us	0.1001 us
John	Different	47.19 us	0.1097 us	0.1027 us
John_Intrinsic	Different	39.06 us	0.0556 us	0.0493 us
Standard	Alternating	48.75 us	0.1244 us	0.1163 us
John	Alternating	47.45 us	0.0728 us	0.0645 us
John_Intrinsic	Alternating	39.10 us	0.0771 us	0.0722 us

Double precision (Math):

Method	Scenario	Mean	Error	StdDev
Standard	Random	179.57 us	0.4397 us	0.4113 us
John	Random	57.50 us	0.0495 us	0.0386 us
John_Intrinsic	Random	43.23 us	0.0243 us	0.0189 us
Standard	Same	48.83 us	0.1420 us	0.1328 us
John	Same	57.39 us	0.0371 us	0.0328 us
John_Intrinsic	Same	43.35 us	0.1234 us	0.1155 us
Standard	Different	58.71 us	0.0581 us	0.0515 us
John	Different	57.44 us	0.0438 us	0.0365 us
John_Intrinsic	Different	43.24 us	0.0465 us	0.0388 us
Standard	Alternating	53.20 us	0.1711 us	0.1517 us
John	Alternating	57.44 us	0.0855 us	0.0714 us
John_Intrinsic	Alternating	43.23 us	0.0263 us	0.0233 us

Across the board, intrinsic is consistently fastest, and the branchless impl is fastest on all except Scenarios.Same for float, and Scenarios.Same and Scenarios.Alternating for double. However, it is <10us slower in these cases, and still >3x faster in Scenarios.Random, which is probably the most common scenario, so at the moment this PR uses the branchless impl for software fallback

Due to issues with x86 intrinsic codegen, it was decided to limit the intrinsic path to x64

Microbenchmark PR at dotnet/performance#1298

john-h-k · 2020-04-25T14:06:05Z

cc @tannergooding

src/libraries/System.Private.CoreLib/src/System/Math.cs

saucecontrol · 2020-04-25T19:21:11Z

Due to issues with x86 intrinsic codegen, it was decided to limit the intrinsic path to x64

Wouldn't this only apply to the double overload on x86, because of the 64-bit constant? And wouldn't a double constant (-0.0) solve the issue?

john-h-k · 2020-04-25T19:33:05Z

Wouldn't this only apply to the double overload on x86, because of the 64-bit constant? And wouldn't a double constant (-0.0) solve the issue?

I rechecked the comment about this, and yep, only the double method was meant to be restricted, which i have now fixed. We determined a double constant had worse codegen on x64 tho

EgorBo · 2020-04-26T01:05:58Z

Interesting, LLVM implements it via ternarylogic on avx-512 hw 🙂 https://godbolt.org/z/a2sdpF

src/libraries/System.Private.CoreLib/src/System/Math.cs

src/libraries/System.Private.CoreLib/src/System/MathF.cs

src/libraries/System.Private.CoreLib/src/System/Math.cs

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.Helpers.cs

src/libraries/System.Private.CoreLib/src/System/Math.cs

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.Helpers.cs

src/libraries/System.Private.CoreLib/src/System/Math.cs

jkotas · 2020-04-26T16:25:25Z

src/libraries/System.Private.CoreLib/src/System/Math.cs

+                // including NaN, so we operate on the raw bits.
+                long xbits = BitConverter.DoubleToInt64Bits(x);
+                long ybits = BitConverter.DoubleToInt64Bits(y);
+


How does the perf of this implementation compare to the original one on x86?

Across the board, intrinsic is consistently fastest, and the branchless impl is fastest on all except Scenarios.Same for float, and Scenarios.Same and Scenarios.Alternating for double. However, it is <10us slower in these cases, and still >3x faster in Scenarios.Random, which is probably the most common scenario, so at the moment this PR uses the branchless impl for software fallback

I think the question was explicitly with regards to 32-bit x86, rather than x64. The numbers will need to be collected independently.

for random/unpredictable it is ~175us -> ~50us

Could you share the before/after disasm for both x86 and x64?

x64 - branched is original, branchless is new

CopySign_Branched(Double, Double) L0000: sub rsp, 0x18 L0004: vzeroupper L0007: vmovsd [rsp+0x10], xmm0 L000d: mov rax, [rsp+0x10] L0012: vmovsd [rsp+0x8], xmm1 L0018: mov rdx, [rsp+0x8] L001d: xor rdx, rax L0020: test rdx, rdx L0023: jge L0040 L0025: mov rdx, 0x8000000000000000 L002f: xor rax, rdx L0032: mov [rsp], rax L0036: vmovsd xmm0, qword [rsp] L003b: add rsp, 0x18 L003f: ret L0040: add rsp, 0x18 L0044: ret CopySign_Branchless(Double, Double) L0000: sub rsp, 0x18 L0004: vzeroupper L0007: vmovsd [rsp+0x10], xmm0 L000d: mov rax, [rsp+0x10] L0012: vmovsd [rsp+0x8], xmm1 L0018: mov rdx, [rsp+0x8] L001d: mov rcx, 0x8000000000000000 L0027: and rdx, rcx L002a: mov rcx, 0x7fffffffffffffff L0034: and rax, rcx L0037: or rax, rdx L003a: mov [rsp], rax L003e: vmovsd xmm0, qword [rsp] L0043: add rsp, 0x18 L0047: ret

CopySign_Branched(Single, Single) L0000: sub rsp, 0x18 L0004: vzeroupper L0007: vmovss [rsp+0x14], xmm0 L000d: mov eax, [rsp+0x14] L0011: vmovss [rsp+0x10], xmm1 L0017: mov edx, [rsp+0x10] L001b: xor edx, eax L001d: test edx, edx L001f: jge L0035 L0021: xor eax, 0x80000000 L0026: mov [rsp+0xc], eax L002a: vmovss xmm0, dword [rsp+0xc] L0030: add rsp, 0x18 L0034: ret L0035: add rsp, 0x18 L0039: ret CopySign_Branchless(Single, Single) L0000: sub rsp, 0x18 L0004: vzeroupper L0007: vmovss [rsp+0x14], xmm0 L000d: mov eax, [rsp+0x14] L0011: vmovss [rsp+0x10], xmm1 L0017: mov edx, [rsp+0x10] L001b: and edx, 0x80000000 L0021: and eax, 0x7fffffff L0026: or eax, edx L0028: mov [rsp+0xc], eax L002c: vmovss xmm0, dword [rsp+0xc] L0032: add rsp, 0x18 L0036: ret

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128Helpers.cs

src/libraries/System.Private.CoreLib/src/System/Math.cs

src/libraries/System.Private.CoreLib/src/System/Numerics/VectorMath.cs

danmoseley · 2020-06-04T21:51:28Z

@john-h-k I think you accidentally rebased master onto your branch rather than vice versa. You can rebase your range of commits on upstream/master and it should go away I think.

danmoseley · 2020-06-04T21:52:13Z

Oops - disregard! Just saw the huge list of commits and didn't realize they were yours.

tannergooding · 2020-06-04T21:55:29Z

It was an accidental rebase, I'm working with John on Discord to get it fixed 😄

…r perf and add integral constant for fallback

…use of blendv

As discussed w tanner

john-h-k · 2020-06-04T22:40:06Z

Oops - disregard! Just saw the huge list of commits and didn't realize they were yours.

iirc at one point i was on the move and made a few of those commits on a phone hence the split file committing 😬

if you put a monkey in a cage with a keyboard it would probably be, on average, more successful with git than me

thank you very much tanner for helping solve it 😄

danmoseley · 2020-06-04T22:42:03Z

@john-h-k

if you put a monkey in a cage with a keyboard it would probably be, on average, more successful with git than me

I bet you haven't mangled all history and then force pushed to master in someone else's repo (that's how I learned 😸 )

john-h-k · 2020-06-05T19:04:48Z

I bet you haven't mangled all history and then force pushed to master in someone else's repo (that's how I learned 😸 )

wow that hurt to read 😬

i'll take this as a learning experience that merging a branch with itself ends badly 😄

john-h-k · 2020-06-05T19:23:51Z

i did not intentionally close this but damn i am really proving my incompotence with git

danmoseley · 2020-06-09T16:29:38Z

Thanks for the contribution @john-h-k ! Perhaps you'd be interested in one of our other up-for-grabs issues?

danmoseley · 2020-06-09T16:32:23Z

Oh, I guess ther'es still dotnet/performance#1298

john-h-k · 2020-06-11T19:37:11Z

Perhaps you'd be interested in one of our other up-for-grabs issues?
Oh, I guess ther'es still dotnet/performance#1298

Got that merged 😄 , so yes definitely

danmoseley · 2020-06-11T20:34:03Z

Great, check out issues with that label and let us know if you need help!

Dotnet-GitSync-Bot added the area-System.Runtime label Apr 25, 2020

john-h-k mentioned this pull request Apr 25, 2020

Introduce MathF.CopySign and Math.CopySign micro benchmarks dotnet/performance#1298

Merged

saucecontrol reviewed Apr 25, 2020

View reviewed changes