Speed up interface checking and casting#49257
Conversation
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
|
@AndyAyersMS btw, the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization - otherwise we can re-order them because of some loop during start up and mess up other cases. |
fb8a3de to
855430a
Compare
|
Updated summary |
|
Can you run a benchmark, e.g. something like this (for |
Have one but its been running for hours (too many variants); will have to try something shorter |
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
Looks good after fixing the condition @jkotas pointed out #49257 (comment) Interface cast benchmark gist
|
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
|
You can delete unnecessary usings in the System.Runtime.CompilerServices.CastHelpers class : |
Done |
| if (interfaceMap[i + 3] == toTypeHnd) | ||
| // If not enough for unrolled, jmp straight to small loop | ||
| // as we already know there is one or more interfaces so don't need to check again. | ||
| goto few; |
There was a problem hiding this comment.
If interface table is allocated as 4 elements aligned and zero-filled, we could remove the "few" case entirely. At least in Chk case which is expected to succeed.
That would come with size increase though, so does not seem worth it.
These methods get prejitted, right? Once we have PGO data for prejitted methods we should look and see if we're happy with the codegen we get. |
|
@AndyAyersMS - Yes, these methods are R2R (so no jit at startup) and may rejit later. They are not |
16% - 38% faster* for
IsInstanceOfInterface&ChkCastInterface[rdx + 8]vs[rdx + r8*8 + 8]; for lower latency and greater CPU port availability for more instruction parallelism (*see "Intel Optimization Reference Manual" sections "3.5.1.2 Using LEA" and "3.5.1.6 Address Calculations")decs exchanged for 1addwith less data depenency*Results #49257 (comment)
/cc @VSadov