Optimise 'Math.CopySign' and 'MathF.CopySign' by john-h-k · Pull Request #26506 · dotnet/coreclr

john-h-k · 2019-09-04T10:12:47Z

Optimise 'Math.CopySign' and 'MathF.CopySign'. Both of these methods can be improved from their current implementation. The new implementation uses SSE intrinsics which are faster, as well as having a faster intrinsic-free branch.

The SSE pathway doesn't spill to the stack, unlike the others, and is branch free. The fallback is also branch free but does spill to the stack (the current implementation spills and contains a branch). These are faster in every scenario, except for the non-SSE fallback for double precision being marginally (10%) slower on x64 in the Same scenario - however, windows (and x64) requires SSE2 so it is unlikely this code is ever going to be run on x86. I haven't profiled on ARM as I don't have access to an ARM system.

Benchmark sources here

Scenarios:
Same - every sign is the same (e.g x == 1f, y == 2f)
Different - every sign is different (e.g x == 1f, y == -2f)
Alternating - alternates between same and different
Random - randomly same or different

Single precision (MathF):

Method	Scenario	Mean	Error	StdDev
Standard	Random	181.14 us	1.0300 us	0.8601 us
John	Random	47.49 us	0.1613 us	0.1347 us
John_Intrinsic	Random	39.79 us	0.4956 us	0.4636 us
Standard	Same	43.59 us	0.2158 us	0.2019 us
John	Same	49.11 us	0.6777 us	0.6339 us
John_Intrinsic	Same	39.74 us	0.2084 us	0.1949 us
Standard	Different	56.04 us	0.6402 us	0.5988 us
John	Different	49.58 us	0.9119 us	0.8529 us
John_Intrinsic	Different	39.80 us	0.2419 us	0.1889 us
Standard	Alternating	48.89 us	0.1422 us	0.1330 us
John	Alternating	47.48 us	0.0766 us	0.0717 us
John_Intrinsic	Alternating	39.05 us	0.0289 us	0.0226 us

Double precision (Math):

Method	Scenario	Mean	Error	StdDev
Standard	Random	176.77 us	0.3537 us	0.2954 us
John	Random	53.74 us	0.1543 us	0.1443 us
John_Intrinsic	Random	41.22 us	0.1321 us	0.1236 us
Standard	Same	48.64 us	0.1322 us	0.1237 us
John	Same	53.85 us	0.1499 us	0.1402 us
John_Intrinsic	Same	41.44 us	0.2931 us	0.2742 us
Standard	Different	59.91 us	0.0642 us	0.0569 us
John	Different	53.72 us	0.1779 us	0.1664 us
John_Intrinsic	Different	41.09 us	0.0346 us	0.0289 us
Standard	Alternating	54.06 us	0.1293 us	0.1210 us
John	Alternating	53.84 us	0.1772 us	0.1658 us
John_Intrinsic	Alternating	41.11 us	0.0681 us	0.0604 us

john-h-k · 2019-09-04T10:13:16Z

cc @tannergooding

john-h-k · 2019-09-04T10:13:54Z

@danmosemsft

Well @john-h-k perhaps you might consider adding a benchmark (such as some form of the one you're using here) to https://github.com/dotnet/performance as well ? There's excellent documentation there: it's easy to use. That would protect your change.

Looking at that right now - will make sure to do so

mikedn · 2019-09-04T11:11:15Z

The SSE pathway doesn't spill to the stack, unlike the others

The spill that happens on the non-SSE path is something that can be fixed in the JIT. I'm not sure if once that is done the SSE path would still be significantly faster to warrant the extra complexity as the ANDPS/ORPS instructions have lower throughput compared to the their integer counterparts.

For benchmarking purposes it would be useful to test the non-SSE version with SingleToInt64Bits replaced by SSE's MOVD intrinsic.

src/System.Private.CoreLib/shared/System/Math.cs

mikedn · 2019-09-04T11:41:32Z

For benchmarking purposes it would be useful to test the non-SSE version with SingleToInt64Bits replaced by SSE's MOVD intrinsic.

Though with ANDNPS I suspect that there's no way that the non-SSE version can be faster, even with MOVD. The SSE code with ANDNPS is:

       48B80000000000000080 mov      rax, 0x8000000000000000
       C4E1F96ED0           vmovd    xmm2, rax
       C5E854C9             vandps   xmm1, xmm2, xmm1
       C5E855C0             vandnps  xmm0, xmm2, xmm0
       C5F856C1             vorps    xmm0, xmm0, xmm1

which is probably as good as it gets.

EgorBo · 2019-09-04T12:02:01Z

just for reference here is what mono generates for:

 [MethodImpl(MethodImplOptions.NoInlining)]
 public static double Test(double x, double y) => Math.CopySign(x, y);

asm:

0000000000000000	movabsq	$0x7fb20121f3e0, %rax
000000000000000a	vandps	(%rax), %xmm1, %xmm1
000000000000000e	movabsq	$0x7fb20121f3f0, %rax
0000000000000018	vandps	(%rax), %xmm0, %xmm0
000000000000001c	vorps	%xmm1, %xmm0, %xmm0
0000000000000020	retq

(llvm.copysign.f64)

src/System.Private.CoreLib/shared/System/Math.cs

tannergooding · 2019-09-04T15:14:50Z

src/System.Private.CoreLib/shared/System/Math.cs

-            // flip the sign bit of x and return the new value;
-            // otherwise, just return x
+                // Remove the sign from x, and remove everything but the sign from y
+                // Creating from a 'const long' is better than from the correct 'const float/double'


What was the performance difference between using Vector128.CreateScalarUnsafe(signMask).AsSingle(), which generates:

mov rax, 0x8000000000000000 vmovd xmm2, rax

and using Vector128.CreateScalarUnsafe(-0.0);, which generates:

vmovsd xmm0, dword [rip+0x15]

The latter has the load from memory, but smaller codegen, which I believe could be impactful to tight loops.
MSVC, GCC, and Clang all look to prefer the latter scenario...

I tried both and in the naive-ish simple creation benchmark, the first appeared to be marginally faster so I chose it

Also, now ANDNPS is used, there is only 1 load rather than 2

the JIT doesn't seem to inline this method in its current state

This is probably because the IL for all code paths is sufficiently big to skip the initial heuristics.

What we've done for some of the other code is mark the method as AggressiveInlining and keep any code-paths we know are small enough inline (i.e. generally the intrinsic paths). The software fallback (or other paths which might not be "small enough") are put in their own local method so the JIT can independently determine whether or not to inline using only that method (another explanation of this is here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128.cs,11).

For example, https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/Vector128.cs,312.

Is it worth applying it to the method given the discussion around its use in tight loops

I would imagine so, the codegen is very small (5 instructions in the intrinsic case and nearly as small in the software fallback case), so the cost of the additional indirection might be greater than the cost of just running the code.

So in this situation because both pathways are likely profitable to be inlined and the fallback is also small, just mark the whole method aggro inlining rather than separate it into a separate fallback method?

I think it would be good to let the JIT make its own decision about the software fallback...

The code generated around the DoubleToInt64 bits is still non-ideal (I imagine the ARM codegen is similar, but I'll need to check):

L0000: sub rsp, 0x18 L0004: vzeroupper L0007: vmovsd [rsp+0x10], xmm1 L000d: mov rax, [rsp+0x10] L0012: vmovsd [rsp+0x8], xmm2 L0018: mov rdx, [rsp+0x8] L001d: mov rcx, 0x8000000000000000 L0027: and rdx, rcx L002a: mov rcx, 0x7fffffffffffffff L0034: and rax, rcx L0037: or rax, rdx L003a: mov [rsp], rax L003e: vmovsd xmm0, qword [rsp] L0043: add rsp, 0x18 L0047: ret

It would also be good to try and get rid of the second constant load here, having the following would be much better:

mov rcx, 0x8000000000000000 and rdx, rcx - mov rcx, 0x7fffffffffffffff + not rcx and rax, rcx

It looks like the C# compiler optimizes it to be two distinct constants in the case where signMask is declared a constant and the JIT does the same when it isn't.

👍 for the separate method fallbacks. Currently trying a few things but am failing to trigger the JIT to generate the second style unfortunately

It might be something the JIT doesn't support today, in which case logging a bug should be sufficient.

tannergooding · 2019-09-04T15:18:05Z

you might consider adding a benchmark

That would be greatly appreciated and would make validating this change on other boxes simpler 😄. In general, the benchmarks are being checked into the dotnet/performance repo. -- @adamsitnik, do we have any guidance on how to run dotnet/performance benchmarks against local dotnet/coreclr builds?

Overall the changes LGTM, just want to test this on a few more boxes (and probably an ARM machine) before committing it.

john-h-k · 2019-09-04T15:26:58Z

Testing on sharplab, the JIT doesn't seem to inline this method in its current state, but will with aggressiveinlining applied. Is it worth applying it to the method given the discussion around its use in tight loops

which I believe could be impactful to tight loops.

Not sure where these vanished too, eek

didn't realise this when i wrapped the line :/

mikedn · 2019-09-04T16:27:48Z

The SSE double version is problematic on x86:

       C5FB10442444 vmovsd   xmm0, qword ptr [esp+44H]
       C5F911442420 vmovupd  xmmword ptr [esp+20H], xmm0
       C5FB104C243C vmovsd   xmm1, qword ptr [esp+3CH]
       C5F9114C2410 vmovupd  xmmword ptr [esp+10H], xmm1
       6800000080   push     0x80000000
       6A00         push     0
       8D4C2408     lea      ecx, bword ptr [esp+08H]
       FF1514C08E05 call     [System.Runtime.Intrinsics.Vector128:CreateScalarUnsafe(long):struct]
       C5F9100424   vmovupd  xmm0, xmmword ptr [esp]
       C5F855442420 vandnps  xmm0, xmm0, xmmword ptr [esp+20H]
       C5F9104C2410 vmovupd  xmm1, xmmword ptr [esp+10H]
       C5F0540C24   vandps   xmm1, xmm1, xmmword ptr [esp]
       C5F856C1     vorps    xmm0, xmm0, xmm1
       C5FB11442430 vmovsd   qword ptr [esp+30H], xmm0
       DD442430     fld      qword ptr [esp+30H]

It probably should only be enabled on x64.

src/System.Private.CoreLib/shared/System/Math.cs

tannergooding · 2019-09-04T16:37:00Z

It probably should only be enabled on x64.

Sounds good to me.

It's worth noting that some of that looks like codegen issues we might want to track... There isn't really any reason why this couldn't be:

vmovq  xmm2, qword ptr [address of ulong constant]  ; This encoding of movq is supported even on 32-bit
vandps xmm1, xmm2, xmm1
vandnps xmm0, xmm2, xmm0
vorps  xmm0, xmm0, xmm1

adamsitnik · 2019-09-04T20:27:44Z

do we have any guidance on how to run dotnet/performance benchmarks against local dotnet/coreclr builds?

yes, all you need to do is to pass the path to corerun executable via --corerun parameter

https://github.com/dotnet/performance/tree/master/src/benchmarks/micro#private-runtime-builds

@john-h-k please let me know if something is not clear or does not work as expected

maryamariyan · 2019-11-06T21:04:13Z

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

In your coreclr repository clone, create patch by running git format-patch origin
In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

maryamariyan · 2019-12-02T19:37:40Z

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

mikedn reviewed Sep 4, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Math.cs Outdated Show resolved Hide resolved

watfordgnf reviewed Sep 4, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Math.cs Outdated Show resolved Hide resolved

john-h-k added 5 commits September 4, 2019 15:54

Optimise 'Math.CopySign' and 'MathF.CopySign'

4344699

Merge mask creation into ANDNPS

58a3bf3

Merge mask creation into ANDNPS (2nd file)

4c62738

Update Math.cs

76e406c

Update MathF.cs

7b74bec

john-h-k force-pushed the copysign-opt branch from 859b8d9 to 7b74bec Compare September 4, 2019 15:11

tannergooding reviewed Sep 4, 2019

View reviewed changes

john-h-k added 3 commits September 4, 2019 16:33

Add using namespaces

93b3db4

Not sure where these vanished too, eek

Remove trailing whitespace

0224aea

didn't realise this when i wrapped the line :/

Seperate fallback; apply AggressiveInlining

3c1565c

mikedn reviewed Sep 4, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Math.cs Outdated Show resolved Hide resolved

john-h-k added 2 commits September 4, 2019 17:41

Limit SSE pathway to 64 bit

cce6626

Fix unnecessary moves in SSE path

2c75f1b

john-h-k force-pushed the copysign-opt branch from 5609a0d to 2c75f1b Compare September 4, 2019 16:53

jkotas added area-System.Runtime area-System.Runtime.Extensions and removed area-System.Runtime labels Nov 2, 2019

adamsitnik added the tenet-performance Performance related issue label Nov 7, 2019

maryamariyan closed this Dec 2, 2019

john-h-k mentioned this pull request Apr 25, 2020

Optimise MathF.CopySign and Math.CopySign using SSE intrinsics dotnet/runtime#35456

Merged

Conversation

john-h-k commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

john-h-k commented Sep 4, 2019

Uh oh!

john-h-k commented Sep 4, 2019

Uh oh!

mikedn commented Sep 4, 2019

Uh oh!

Uh oh!

mikedn commented Sep 4, 2019

Uh oh!

EgorBo commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Sep 4, 2019

Uh oh!

john-h-k commented Sep 4, 2019

Uh oh!

mikedn commented Sep 4, 2019

Uh oh!

Uh oh!

tannergooding commented Sep 4, 2019

Uh oh!

adamsitnik commented Sep 4, 2019

Uh oh!

maryamariyan commented Nov 6, 2019

Uh oh!

maryamariyan commented Dec 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

john-h-k commented Sep 4, 2019 •

edited

Loading

EgorBo commented Sep 4, 2019 •

edited

Loading