Speed up interface checking and casting by benaadams · Pull Request #49257 · dotnet/runtime

benaadams · 2021-03-06T04:49:23Z

16% - 38% faster* for IsInstanceOfInterface & ChkCastInterface

Less arguments in addressing modes [rdx + 8] vs [rdx + r8*8 + 8]; for lower latency and greater CPU port availability for more instruction parallelism (*see "Intel Optimization Reference Manual" sections "3.5.1.2 Using LEA" and "3.5.1.6 Address Calculations")
- 2 operand addressing has latency of 1 clock and 2 ports available so a reciprocal throughput of 0.5 clocks
- 3 operand addressing has latency of 3 clocks and 1 port available so a reciprocal throughput of 1 clock
Reduce branches, conditional jumps reduced from 8 to 5
4 decs exchanged for 1 add with less data depenency

-       xor      r8d, r8d
+       cmp      rcx, 4
+       jge      SHORT G_M5541_IG06                     >---\
        align    [0 bytes]                                  |
-                       ;; bbWeight=0.50 PerfScore 3.50     |
+                       ;; bbWeight=0.50 PerfScore 4.00     |
                                                            |
-G_M57878_IG04:                             <---\           |
-       cmp      qword ptr [rdx+8*r8], rsi      |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+8], rsi    |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+16], rsi   |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+24], rsi   |           |
-       je       SHORT G_M57878_IG06            |           |
-       je       SHORT G_M57878_IG05            |           |
-       add      r8, 4                          |           |
-       jmp      SHORT G_M57878_IG04        >---/           |
-                       ;; bbWeight=4    PerfScore 77.00    |
+G_M5541_IG04:                              <---\           |
+       lea      r8, [rdx+32]                   |           |
+       cmp      qword ptr [rdx], rsi           |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+8], rsi         |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+16], rsi        |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+24], rsi        |           |
+       je       SHORT G_M5541_IG08             |           |
+       mov      rdx, r8                        |           |
+       add      rcx, -4                        |           |
+       cmp      rcx, 4                         |           |
+       jge      SHORT G_M5541_IG04         >---/           |
+                       ;; bbWeight=4    PerfScore 57.00    |
+                                                           |
+G_M5541_IG05:                                              |
+       test     rcx, rcx                                   |
+       je       SHORT G_M5541_IG07                         |
+                       ;; bbWeight=0.50 PerfScore 0.62     |
+                                                           |
+G_M5541_IG06:                              <---\       <---/
+       lea      r8, [rdx+8]                    |
+       cmp      qword ptr [rdx], rsi           |
+       je       SHORT G_M5541_IG08             |
+       mov      rdx, r8                        |
        dec      rcx                            |
+       test     rcx, rcx                       |
+       jg       SHORT G_M5541_IG06         >---/
+                       ;; bbWeight=4    PerfScore 21.00

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

EgorBo · 2021-03-06T10:06:52Z

@AndyAyersMS btw, the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization - otherwise we can re-order them because of some loop during start up and mess up other cases.

benaadams · 2021-03-06T19:08:07Z

Updated summary

EgorBo · 2021-03-06T19:58:41Z

Can you run a benchmark, e.g. something like this (for IsInstanceOfClass ): https://gist.github.com/EgorBo/8766a778fa2522fad4e916c567a2b5d7

benaadams · 2021-03-07T02:23:12Z

Can you run a benchmark

Have one but its been running for hours (too many variants); will have to try something shorter

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

benaadams · 2021-03-07T05:26:26Z

Can you run a benchmark

Looks good after fixing the condition @jkotas pointed out #49257 (comment)

Interface cast benchmark gist

Method	Implementation	Mean	branch	Ratio
IsInterface1	Interfaces1	2.166 ns	current	1.000
IsInterface1	Interfaces1	1.629 ns	PR	0.752
IsNotImplemented	Interfaces1	2.166 ns	current	1.000
IsNotImplemented	Interfaces1	1.950 ns	PR	0.900

IsInterface1	Interfaces6	2.155 ns	current	1.000
IsInterface1	Interfaces6	1.936 ns	PR	0.898
IsInterface2	Interfaces6	2.161 ns	current	1.000
IsInterface2	Interfaces6	1.908 ns	PR	0.883
IsInterface3	Interfaces6	2.188 ns	current	1.000
IsInterface3	Interfaces6	1.672 ns	PR	0.764
IsInterface4	Interfaces6	2.346 ns	current	1.000
IsInterface4	Interfaces6	1.965 ns	PR	0.838
IsInterface5	Interfaces6	3.122 ns	current	1.000
IsInterface5	Interfaces6	2.173 ns	PR	0.696
IsInterface6	Interfaces6	3.472 ns	current	1.000
IsInterface6	Interfaces6	2.687 ns	PR	0.774
IsNotImplemented	Interfaces6	3.644 ns	current	1.000
IsNotImplemented	Interfaces6	3.071 ns	PR	0.843

IsInterface1	Interfaces16	2.154 ns	current	1.000
IsInterface1	Interfaces16	1.958 ns	PR	0.909
IsInterface2	Interfaces16	1.938 ns	current	1.000
IsInterface2	Interfaces16	1.950 ns	PR	1.006
IsInterface3	Interfaces16	2.185 ns	current	1.000
IsInterface3	Interfaces16	1.661 ns	PR	0.760
IsInterface4	Interfaces16	2.326 ns	current	1.000
IsInterface4	Interfaces16	1.655 ns	PR	0.712
IsInterface5	Interfaces16	3.201 ns	current	1.000
IsInterface5	Interfaces16	2.349 ns	PR	0.734
IsInterface6	Interfaces16	3.186 ns	current	1.000
IsInterface6	Interfaces16	2.376 ns	PR	0.746
IsInterface7	Interfaces16	3.360 ns	current	1.000
IsInterface7	Interfaces16	2.578 ns	PR	0.767
IsInterface8	Interfaces16	3.764 ns	current	1.000
IsInterface8	Interfaces16	2.621 ns	PR	0.696
IsInterface9	Interfaces16	4.291 ns	current	1.000
IsInterface9	Interfaces16	3.091 ns	PR	0.720
IsInterface10	Interfaces16	4.866 ns	current	1.000
IsInterface10	Interfaces16	3.021 ns	PR	0.621
IsInterface11	Interfaces16	4.600 ns	current	1.000
IsInterface11	Interfaces16	3.439 ns	PR	0.748
IsInterface12	Interfaces16	4.908 ns	current	1.000
IsInterface12	Interfaces16	3.376 ns	PR	0.688
IsInterface13	Interfaces16	5.862 ns	current	1.000
IsInterface13	Interfaces16	3.720 ns	PR	0.635
IsInterface14	Interfaces16	6.111 ns	current	1.000
IsInterface14	Interfaces16	3.682 ns	PR	0.603
IsInterface15	Interfaces16	5.902 ns	current	1.000
IsInterface15	Interfaces16	3.955 ns	PR	0.670
IsInterface16	Interfaces16	6.406 ns	current	1.000
IsInterface16	Interfaces16	3.947 ns	PR	0.616
IsNotImplemented	Interfaces16	6.689 ns	current	1.000
IsNotImplemented	Interfaces16	5.412 ns	PR	0.809

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

jkotas

Looks great to me. @VSadov Could you please take a look as well?

MaherJendoubi · 2021-03-07T17:34:13Z

You can delete unnecessary usings in the System.Runtime.CompilerServices.CastHelpers class :
using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86;

benaadams · 2021-03-07T17:46:01Z

You can delete unnecessary usings ...

Done

VSadov · 2021-03-07T18:03:06Z

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

-                    if (interfaceMap[i + 3] == toTypeHnd)
+                    // If not enough for unrolled, jmp straight to small loop
+                    // as we already know there is one or more interfaces so don't need to check again.
+                    goto few;


If interface table is allocated as 4 elements aligned and zero-filled, we could remove the "few" case entirely. At least in Chk case which is expected to succeed.
That would come with size increase though, so does not seem worth it.

VSadov

LGTM, Nice!!!

AndyAyersMS · 2021-03-07T23:57:48Z

the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization

These methods get prejitted, right? Once we have PGO data for prejitted methods we should look and see if we're happy with the codegen we get.

VSadov · 2021-03-08T02:54:39Z

@AndyAyersMS - Yes, these methods are R2R (so no jit at startup) and may rejit later.

They are not AggressiveOptimization. We know that will make them faster by avoiding tiering indirection, but we will have to jit at startup and we want to avoid that.

ghost added the area-VM-coreclr label Mar 6, 2021

EgorBo reviewed Mar 6, 2021

View reviewed changes