Skip to content

Optimise MathF.CopySign and Math.CopySign using SSE intrinsics#35456

Merged
tannergooding merged 22 commits intodotnet:masterfrom
john-h-k:optimise-copysign
Jun 9, 2020
Merged

Optimise MathF.CopySign and Math.CopySign using SSE intrinsics#35456
tannergooding merged 22 commits intodotnet:masterfrom
john-h-k:optimise-copysign

Conversation

@john-h-k
Copy link
Contributor

@john-h-k john-h-k commented Apr 25, 2020

Move of dotnet/coreclr#26506

Optimise 'Math.CopySign' and 'MathF.CopySign'. Both of these methods can be improved from their current implementation. The new implementation uses SSE intrinsics which are faster, as well as having a faster intrinsic-free branch.

The SSE pathway doesn't spill to the stack, unlike the others, and is branch free. The fallback is also branch free but does spill to the stack (the current implementation spills and contains a branch). These are faster in every scenario, except for the non-SSE fallback for double precision being marginally (10%) slower on x64 in the Same scenario - however, windows (and x64) requires SSE2 so it is unlikely this code is ever going to be run on x86. I haven't profiled on ARM as I don't have access to an ARM system.

Benchmark sources here

Scenarios:
Same - every sign is the same (e.g x == 1f, y == 2f)
Different - every sign is different (e.g x == 1f, y == -2f)
Alternating - alternates between same and different
Random - randomly same or different

Methods:
Standard - current impls
John - impls with branches removed (still non-intrinsic)
John_intrinsic - SSE impls

Single precision (MathF):

Method Scenario Mean Error StdDev
Standard Random 171.66 us 0.5269 us 0.4929 us
John Random 47.51 us 0.1071 us 0.1002 us
John_Intrinsic Random 39.08 us 0.0543 us 0.0508 us
Standard Same 43.31 us 0.0900 us 0.0842 us
John Same 47.53 us 0.1130 us 0.1057 us
John_Intrinsic Same 39.04 us 0.0369 us 0.0288 us
Standard Different 54.78 us 0.1070 us 0.1001 us
John Different 47.19 us 0.1097 us 0.1027 us
John_Intrinsic Different 39.06 us 0.0556 us 0.0493 us
Standard Alternating 48.75 us 0.1244 us 0.1163 us
John Alternating 47.45 us 0.0728 us 0.0645 us
John_Intrinsic Alternating 39.10 us 0.0771 us 0.0722 us

Double precision (Math):

Method Scenario Mean Error StdDev
Standard Random 179.57 us 0.4397 us 0.4113 us
John Random 57.50 us 0.0495 us 0.0386 us
John_Intrinsic Random 43.23 us 0.0243 us 0.0189 us
Standard Same 48.83 us 0.1420 us 0.1328 us
John Same 57.39 us 0.0371 us 0.0328 us
John_Intrinsic Same 43.35 us 0.1234 us 0.1155 us
Standard Different 58.71 us 0.0581 us 0.0515 us
John Different 57.44 us 0.0438 us 0.0365 us
John_Intrinsic Different 43.24 us 0.0465 us 0.0388 us
Standard Alternating 53.20 us 0.1711 us 0.1517 us
John Alternating 57.44 us 0.0855 us 0.0714 us
John_Intrinsic Alternating 43.23 us 0.0263 us 0.0233 us

Across the board, intrinsic is consistently fastest, and the branchless impl is fastest on all except Scenarios.Same for float, and Scenarios.Same and Scenarios.Alternating for double. However, it is <10us slower in these cases, and still >3x faster in Scenarios.Random, which is probably the most common scenario, so at the moment this PR uses the branchless impl for software fallback

Due to issues with x86 intrinsic codegen, it was decided to limit the intrinsic path to x64

Microbenchmark PR at dotnet/performance#1298

@john-h-k
Copy link
Contributor Author

cc @tannergooding

@saucecontrol
Copy link
Member

Due to issues with x86 intrinsic codegen, it was decided to limit the intrinsic path to x64

Wouldn't this only apply to the double overload on x86, because of the 64-bit constant? And wouldn't a double constant (-0.0) solve the issue?

@john-h-k
Copy link
Contributor Author

Wouldn't this only apply to the double overload on x86, because of the 64-bit constant? And wouldn't a double constant (-0.0) solve the issue?

I rechecked the comment about this, and yep, only the double method was meant to be restricted, which i have now fixed. We determined a double constant had worse codegen on x64 tho

@EgorBo
Copy link
Member

EgorBo commented Apr 26, 2020

Interesting, LLVM implements it via ternarylogic on avx-512 hw 🙂 https://godbolt.org/z/a2sdpF

// including NaN, so we operate on the raw bits.
long xbits = BitConverter.DoubleToInt64Bits(x);
long ybits = BitConverter.DoubleToInt64Bits(y);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the perf of this implementation compare to the original one on x86?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Across the board, intrinsic is consistently fastest, and the branchless impl is fastest on all except Scenarios.Same for float, and Scenarios.Same and Scenarios.Alternating for double. However, it is <10us slower in these cases, and still >3x faster in Scenarios.Random, which is probably the most common scenario, so at the moment this PR uses the branchless impl for software fallback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question was explicitly with regards to 32-bit x86, rather than x64. The numbers will need to be collected independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for random/unpredictable it is ~175us -> ~50us

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share the before/after disasm for both x86 and x64?

Copy link
Contributor Author

@john-h-k john-h-k Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x64 - branched is original, branchless is new

CopySign_Branched(Double, Double)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovsd [rsp+0x10], xmm0
    L000d: mov rax, [rsp+0x10]
    L0012: vmovsd [rsp+0x8], xmm1
    L0018: mov rdx, [rsp+0x8]
    L001d: xor rdx, rax
    L0020: test rdx, rdx
    L0023: jge L0040
    L0025: mov rdx, 0x8000000000000000
    L002f: xor rax, rdx
    L0032: mov [rsp], rax
    L0036: vmovsd xmm0, qword [rsp]
    L003b: add rsp, 0x18
    L003f: ret
    L0040: add rsp, 0x18
    L0044: ret


CopySign_Branchless(Double, Double)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovsd [rsp+0x10], xmm0
    L000d: mov rax, [rsp+0x10]
    L0012: vmovsd [rsp+0x8], xmm1
    L0018: mov rdx, [rsp+0x8]
    L001d: mov rcx, 0x8000000000000000
    L0027: and rdx, rcx
    L002a: mov rcx, 0x7fffffffffffffff
    L0034: and rax, rcx
    L0037: or rax, rdx
    L003a: mov [rsp], rax
    L003e: vmovsd xmm0, qword [rsp]
    L0043: add rsp, 0x18
    L0047: ret
CopySign_Branched(Single, Single)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovss [rsp+0x14], xmm0
    L000d: mov eax, [rsp+0x14]
    L0011: vmovss [rsp+0x10], xmm1
    L0017: mov edx, [rsp+0x10]
    L001b: xor edx, eax
    L001d: test edx, edx
    L001f: jge L0035
    L0021: xor eax, 0x80000000
    L0026: mov [rsp+0xc], eax
    L002a: vmovss xmm0, dword [rsp+0xc]
    L0030: add rsp, 0x18
    L0034: ret
    L0035: add rsp, 0x18
    L0039: ret

CopySign_Branchless(Single, Single)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovss [rsp+0x14], xmm0
    L000d: mov eax, [rsp+0x14]
    L0011: vmovss [rsp+0x10], xmm1
    L0017: mov edx, [rsp+0x10]
    L001b: and edx, 0x80000000
    L0021: and eax, 0x7fffffff
    L0026: or eax, edx
    L0028: mov [rsp+0xc], eax
    L002c: vmovss xmm0, dword [rsp+0xc]
    L0032: add rsp, 0x18
    L0036: ret

@danmoseley
Copy link
Member

@john-h-k I think you accidentally rebased master onto your branch rather than vice versa. You can rebase your range of commits on upstream/master and it should go away I think.

@danmoseley
Copy link
Member

Oops - disregard! Just saw the huge list of commits and didn't realize they were yours.

@tannergooding
Copy link
Member

It was an accidental rebase, I'm working with John on Discord to get it fixed 😄

@john-h-k john-h-k force-pushed the optimise-copysign branch from 4f73ae9 to 783362d Compare June 4, 2020 22:38
@john-h-k
Copy link
Contributor Author

john-h-k commented Jun 4, 2020

Oops - disregard! Just saw the huge list of commits and didn't realize they were yours.

iirc at one point i was on the move and made a few of those commits on a phone hence the split file committing 😬

if you put a monkey in a cage with a keyboard it would probably be, on average, more successful with git than me

thank you very much tanner for helping solve it 😄

@danmoseley
Copy link
Member

@john-h-k

if you put a monkey in a cage with a keyboard it would probably be, on average, more successful with git than me

I bet you haven't mangled all history and then force pushed to master in someone else's repo (that's how I learned 😸 )

@john-h-k
Copy link
Contributor Author

john-h-k commented Jun 5, 2020

I bet you haven't mangled all history and then force pushed to master in someone else's repo (that's how I learned 😸 )

wow that hurt to read 😬

i'll take this as a learning experience that merging a branch with itself ends badly 😄

@john-h-k john-h-k closed this Jun 5, 2020
@john-h-k john-h-k reopened this Jun 5, 2020
@john-h-k
Copy link
Contributor Author

john-h-k commented Jun 5, 2020

i did not intentionally close this but damn i am really proving my incompotence with git

@tannergooding tannergooding merged commit b8e1a49 into dotnet:master Jun 9, 2020
@danmoseley
Copy link
Member

Thanks for the contribution @john-h-k ! Perhaps you'd be interested in one of our other up-for-grabs issues?

@danmoseley
Copy link
Member

Oh, I guess ther'es still dotnet/performance#1298

@john-h-k
Copy link
Contributor Author

Perhaps you'd be interested in one of our other up-for-grabs issues?
Oh, I guess ther'es still dotnet/performance#1298

Got that merged 😄 , so yes definitely

@danmoseley
Copy link
Member

Great, check out issues with that label and let us know if you need help!

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants