WIP: Peephole optimize redundant and shuffle moves by AndyAyersMS · Pull Request #21959 · dotnet/coreclr

AndyAyersMS · 2019-01-11T17:52:08Z

Proof of concept for a late peephole optimization.

Take advantage of the fact that the emitter tracks the last emitted instruction and that IGs do not span basic blocks to do a simple peephole optimization that can suppress redundant (mov rax, rdx; mov rax rdx) and shuffle (mov rax, rdx; mov rdx, rax) moves.

In combination these two address simple the cases of shuffle move chains like

mov rax, rdx
mov rdx, rax      ==>     mov rax, rdx
mov rax, rdx

The second mov is supressed as shuffle and so the third move becomes redundant.

@dotnet/jit-contrib curious to hear your thoughts on this approach. It would serve as a complementary piece and is not intended as the primary solution to various CQ issues. And it could give us a template for doing other simple-minded peephole opts.

We could also look back further (with some extra work) to catch the two-register variant which we sometimes see from promoted Span<T> copies.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -20032 (-0.05% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -4359 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.08% of base)
       -2809 : System.Private.CoreLib.dasm (-0.07% of base)
       -2497 : Microsoft.CodeAnalysis.CSharp.dasm (-0.06% of base)
       -1474 : System.Private.Xml.dasm (-0.04% of base)
        -834 : System.Memory.dasm (-0.51% of base)
90 total files with size differences (90 improved, 0 regressed), 39 unchanged.
Top method improvements by size (bytes):
        -135 (-2.61% of base) : System.IO.FileSystem.dasm - FileSystemEnumerator`1:MoveNext():bool:this (5 methods)
         -90 (-2.29% of base) : System.Private.CoreLib.dasm - TextInfo:ChangeCaseCommon(ref):ref:this (6 methods)
         -90 (-4.27% of base) : System.Private.CoreLib.dasm - EventPipePayloadDecoder:DecodePayload(byref,struct):ref
         -73 (-1.25% of base) : System.Security.Cryptography.Algorithms.dasm - KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,struct,ref,byref,byref) (5 methods)
         -66 (-2.12% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Scanner:ScanXmlString(ushort,ushort,bool):ref:this
Top method improvements by size (percentage):
          -6 (-15.79% of base) : Microsoft.CodeAnalysis.dasm - ModuleMetadata:CreateFromFile(ref):ref
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - CustomEventAccessorSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetReturnTypeAttributeDeclarations():struct:this
         -22 (-11.40% of base) : System.Private.Xml.dasm - NameTable:ComputeHash32(ref,int,int):int
3138 total methods with size differences (3138 improved, 0 regressed), 190301 unchanged.

Sample cherry-picked diff:

;; System.Private.Xml      
;; NameTable:ComputeHash32(ref,int,int):int

 G_M17318_IG06:
        mov      rcx, rax
-       mov      rax, rcx
        imul     ecx, r9d, 2
        jo       SHORT G_M17318_IG08
        mov      rdx, rax               // this is also now dead....
-       mov      rax, rdx
-       mov      rdx, rax
-       mov      rax, rdx
-       mov      rdx, rax
-       mov      rax, rdx
        mov      rsi, rax
        mov      edi, ecx
        mov      rcx, 0xD1FFAB1E
        mov      edx, 100
        call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE

benaadams · 2019-01-11T18:01:57Z

We could also look back further (with some extra work) to catch the two-register variant which we sometimes see from promoted Span<T> copies.

Was just looking at this in Kestrel HttpParser`1:ParseRequestLine(struct,byref,byref,byref):bool:this

G_M24661_IG39:
       488BCB               mov      rcx, rbx
       488BD9               mov      rbx, rcx
       418BF5               mov      esi, r13d
       488BCB               mov      rcx, rbx
       448BEE               mov      r13d, esi
       488BD9               mov      rbx, rcx
       418BF5               mov      esi, r13d
       EB21                 jmp      SHORT G_M24661_IG43

That kinda thing?

AndyAyersMS · 2019-01-11T18:21:18Z

Yes, exactly that.

4creators · 2019-01-11T18:56:55Z

Any results from benchmarks?

CarolEidt · 2019-01-11T22:20:20Z

src/jit/codegenxarch.cpp

+                    // Sure would be nice to just provisionally emit the new instruction and compare
+                    // instrDescs, but that's not possible just yet.
+                    //
+                    // Do we need any other safety checks here? Maybe same gcness?


It's been a while since I dealt with the gc reporting, but I think it's the case that the emitter does the final reporting, so I don't think there'd be an impact because we haven't actually emitted this instruction yet, and the gcness of the registers isn't changing.

CarolEidt · 2019-01-11T22:21:29Z

This is interesting, and although it's a bit messy, it's probably pretty cheap.

benaadams · 2019-01-11T23:40:03Z

Tizen is some artifact copy issue:

First time build. Skipping changelog.
Run condition [Current build status] enabling prebuild for step [[Archive the artifacts]]
Copied 1 artifact from "dotnet_coreclr » master » checked_windows_nt_bld" build number 38414
ERROR: Unable to find project for artifact copy: dotnet_corefx/master/tizen_armel_cross_release
This may be due to incorrect project name or permission settings; see help for project name in job configuration.
[Current build status] check if current [FAILURE] is worse or equals then [SUCCESS] and better or equals then [SUCCESS]
Run condition [Current build status] preventing perform for step [[Archive the artifacts]]
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of fcb12fff72cdf317b99169339035a416589ba251 to FAILURE with url https://ci.dot.net/job/dotnet_coreclr/job/master/job/armel_cross_checked_tizen_innerloop_prtest/7438/ and message: 'Build finished. '
Using context: Tizen armel Cross Checked Innerloop Build and Test
[WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done
Finished: FAILURE

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

AndyAyersMS · 2019-01-12T01:29:27Z

Updated to look back two instructions (when safe to do so).

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -22293 (-0.06% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -4509 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.08% of base)
       -3387 : System.Private.CoreLib.dasm (-0.09% of base)
       -2552 : Microsoft.CodeAnalysis.CSharp.dasm (-0.06% of base)
       -1519 : System.Private.Xml.dasm (-0.04% of base)
       -1230 : System.Memory.dasm (-0.76% of base)
90 total files with size differences (90 improved, 0 regressed), 39 unchanged.
Top method improvements by size (bytes):
        -135 (-2.61% of base) : System.IO.FileSystem.dasm - FileSystemEnumerator`1:MoveNext():bool:this (5 methods)
        -112 (-7.61% of base) : System.Private.CoreLib.dasm - Memory`1:CopyTo(struct):this (5 methods)
        -112 (-7.47% of base) : System.Private.CoreLib.dasm - ReadOnlyMemory`1:CopyTo(struct):this (5 methods)
         -99 (-4.70% of base) : System.Private.CoreLib.dasm - EventPipePayloadDecoder:DecodePayload(byref,struct):ref
         -90 (-2.29% of base) : System.Private.CoreLib.dasm - TextInfo:ChangeCaseCommon(ref):ref:this (6 methods)
Top method improvements by size (percentage):
          -6 (-15.79% of base) : Microsoft.CodeAnalysis.dasm - ModuleMetadata:CreateFromFile(ref):ref
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - CustomEventAccessorSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetReturnTypeAttributeDeclarations():struct:this
         -22 (-11.40% of base) : System.Private.Xml.dasm - NameTable:ComputeHash32(ref,int,int):int
3166 total methods with size differences (3166 improved, 0 regressed), 190273 unchanged.

Another cherry-picked diff showing a double shuffle getting zapped:

 ;; System.Memory
 ;; BuffersExtensions:CopyTo(byref,struct)
 G_M61410_IG18:
        mov      r8, r14
-       mov      r14, r8
        mov      r8d, r15d
        mov      rdx, r14
-       mov      r15d, r8d
-       mov      r14, rdx
-       mov      r8d, r15d
        cmp      r8d, ebx
        ja       SHORT G_M61410_IG26
        mov      rdx, r14
        movsxd   r8, r8d
        shl      r8, 5
        mov      rcx, rdi
        call     Buffer:Memmove(byref,byref,long)

AndyAyersMS · 2019-01-12T01:36:01Z

Thinking I may move the instrDesc specific checks over to the emitter. This would let us generalize to a broader set of forms, in particular for the interferferesWith style checks.

fiigii · 2019-01-12T01:56:28Z

Perhaps we should JitDump the skipped instructions so that we could root cause the redundancy if needed in the future.

AndyAyersMS · 2019-01-12T02:21:42Z

OSX failed during build, unable to write an exectuable:

17:39:52 [ 60%] Linking CXX executable paltest_swscanf_test14
17:39:53 ld: can't write to output file: paltest_swscanf_test14, errno=28 for architecture x86_64
17:39:53 clang: error: linker command failed with exit code 1 (use -v to see invocation)
17:39:53 [ 60%] Built target paltest_swscanf_test12
17:39:53 make[2]: *** [src/pal/tests/palsuite/c_runtime/swscanf/test14/paltest_swscanf_test14] Error 1

AndyAyersMS · 2019-01-12T02:51:39Z

There is no actual instruction (instrDesc) to dump, though we could fake one up or fairly simply describe what it would look like.

Things seem fairly clear from the current jit dump, at least to me ...

Generating: N089 (  7,  6) [000158] DA--G-------              *  STORE_LCL_VAR byref  V09 tmp8         d:1 rcx REG rcx
							Byref regs: 00000001 {rax} => 00000000 {}
IN0010:        mov      rcx, rax
							V09 in reg rcx is becoming live  [000158]
							Live regs: 00000004 {rdx} => 00000006 {rcx rdx}
							Live vars: {V04} => {V04 V09}
							Byref regs: 00000000 {} => 00000002 {rcx}
genIPmappingAdd: ignoring duplicate IL offset 0x0
Generating: N091 (  7,  5) [000116] ------------                 IL_OFFSET void   IL offset: 0x0 REG NA
Generating: N093 (  3,  2) [000156] ------------       t156 =    LCL_VAR   byref  V09 tmp8         u:1 rcx (last use) REG rcx <l:$103, c:$102>
                                                             /--*  t156   byref  
Generating: N095 (  7,  5) [000114] DA----------              *  STORE_LCL_VAR byref  V14 tmp13        d:1 rax REG rax
							V09 in reg rcx is becoming dead  [000156]
							Live regs: 00000006 {rcx rdx} => 00000004 {rdx}
							Live vars: {V04 V09} => {V04}
							Byref regs: 00000002 {rcx} => 00000000 {}

-- skipping emission -- previous instr IN0010 makes this redundant
							V14 in reg rax is becoming live  [000114]
							Live regs: 00000004 {rdx} => 00000005 {rax rdx}
							Live vars: {V04} => {V04 V14}
							Byref regs: 00000000 {} => 00000001 {rax}

The liveness updates for suppressed instructions seem a bit scary, but are necessary.

For example, in the case above we have a byref in rax and we track gc liveness changes as follows, where "gc live" is the set of gc live regs after the instruction is executed, and the `"..." shows transient GC live states that get overwritten. Because the peepholes are done independently we need to show that we can do any subset of them and still end up with correctly described GC states. And this only works out if liveness updates for those instructions are processed even though the instructions themselves are dropped.

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       rcx
   mov rax, rcx    -rcx +rax       rax
   mov rcx, rax    -rax +rcx       rcx

with the full peephole

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       ... (rcx)
                   -rcx +rax       ... (rax)
                   -rax +rcx       rcx

with just the first instruction peephole

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       ... (rcx)
                   -rcx +rax       rax
   mov rcx, rax    -rax +rcx       rcx

with just the second instruction peephole

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       rcx
   mov rax, rcx    -rcx +rax       ... (rax)
                   -rax +rcx       rcx

benaadams · 2019-01-13T22:06:10Z

Possibly something that could benefit from a similar optimization? I'm seeing this in async .MoveNext() methods:

       488B4D10             mov      rcx, bword ptr [rbp+10H]  ; load (rcx) [rbp+10H]
       895104               mov      dword ptr [rcx+4], edx
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]
       837A0464             cmp      dword ptr [rdx+4], 100
       0F8C01FFFFFF         jl       G_M10160_IG04

G_M10160_IG09:
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]
       C702FEFFFFFF         mov      dword ptr [rdx], -2
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]
       3912                 cmp      dword ptr [rdx], edx
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]

benaadams · 2019-01-13T23:48:30Z

Hmm... maybe not, might be volatile?

benaadams · 2019-01-14T00:20:31Z

Raised issue https://github.com/dotnet/coreclr/issues/21973

danmoseley · 2019-01-14T02:10:09Z

Will this potentially conceal future regressions in code quality in earlier stages?

benaadams · 2019-01-14T18:02:19Z

src/jit/codegenxarch.cpp

+//
+// We could also think about handling reg-mem moves in a similar manner,
+// especially when mem is a stack access. This could give us writethrough like
+// codegen for EH methods and (perhaps, if we dared) simpler code for minopts


I think reg-mem moves would also improve all async methods? https://github.com/dotnet/coreclr/issues/21973

briansull · 2019-01-14T18:39:01Z

src/jit/codegenxarch.cpp

+// eg    mov [rbp + 40], rax
+//       mov rax, [rbp + 40]  -- can omit the "reload"
+//
+bool CodeGen::isRedundantMove(


We should also implement this for ARM64 and ARM32

benaadams · 2019-01-15T00:50:49Z

Could the peephole also forward? e.g. (from https://github.com/dotnet/coreclr/issues/21973#issuecomment-454217381)

       mov      dword ptr [rsi+4], edx
       cmp      dword ptr [rsi+4], 100

to

       mov      dword ptr [rsi+4], edx
       cmp      edx, 100

AndyAyersMS · 2019-01-15T01:28:38Z

Something like this is probably better done earlier when we have a better model for memory, but yeah we might be able to find cases where we can change instructions like this.

redknightlois · 2019-01-29T17:44:01Z

@AndyAyersMS Could this method be generalized to lookback way more than 2 if AggressiveOptimization flag is on for a certain method? When trying to optimize SequenceCompareTo which is a very core method these redundant chains kill performance.

AndyAyersMS · 2019-01-29T17:58:53Z

I would hope #19429 would get rid of most cases of these shuffle chains.

AndyAyersMS · 2019-02-07T16:11:19Z

Need to re-evaluate now that #19429 is in.

Also, should follow the guidance that led to revision of #22454 and not expose emitter details to the code generator.

AndyAyersMS · 2019-02-13T16:47:46Z

Looks like the recent preferencing changes got about 90% of the cases above.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -1135 (-0.00% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -739 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.01% of base)
        -128 : System.Private.Xml.dasm (-0.00% of base)
         -71 : System.Private.CoreLib.dasm (-0.00% of base)
         -24 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base)
         -18 : Newtonsoft.Json.dasm (-0.00% of base)
30 total files with size differences (30 improved, 0 regressed), 99 unchanged.
Top method improvements by size (bytes):
         -27 (-1.03% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseExitStatement():ref:this
         -15 (-0.53% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseInterpolatedStringInterpolation():ref:this
         -12 (-1.09% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseContinueStatement():ref:this
         -12 (-0.63% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseCaseStatement():ref:this
         -12 (-1.07% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseXmlElementStartTag(int):ref:this
Top method improvements by size (percentage):
          -3 (-2.94% of base) : System.ComponentModel.TypeConverter.dasm - InterlockedBitVector32:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - EventToken:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - MethodToken:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - ParameterToken:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - PropertyToken:Equals(ref):bool:this
292 total methods with size differences (292 improved, 0 regressed), 193009 unchanged.

I'm going to shelve this for now.

AndyAyersMS added 2 commits January 10, 2019 13:28

proof of concept

9d5ef11

ensure IG non-empty

fcb12ff

CarolEidt reviewed Jan 11, 2019

View reviewed changes

two-instruction lookback

970b82b

benaadams reviewed Jan 14, 2019

View reviewed changes

briansull reviewed Jan 14, 2019

View reviewed changes

benaadams mentioned this pull request Jan 15, 2019

Propagate preferences #19429

Merged

AndyAyersMS closed this Feb 13, 2019

AndyAyersMS mentioned this pull request Mar 15, 2021

Redundant mov instructions generated dotnet/runtime#49535

Closed

AndyAyersMS mentioned this pull request Apr 28, 2021

PMI diffs are failing due to assert dotnet/runtime#51728

Closed

AndyAyersMS mentioned this pull request Jun 12, 2021

Port IsRedundantMov to xarch dotnet/runtime#54075

Merged

AndyAyersMS mentioned this pull request Oct 9, 2021

XARCH: Eliminate redundant mov to/from mem. dotnet/runtime#59780

Merged

Conversation

AndyAyersMS commented Jan 11, 2019

Uh oh!

benaadams commented Jan 11, 2019

Uh oh!

AndyAyersMS commented Jan 11, 2019

Uh oh!

4creators commented Jan 11, 2019

Uh oh!

CarolEidt Jan 11, 2019

Choose a reason for hiding this comment

Uh oh!

CarolEidt commented Jan 11, 2019

Uh oh!

benaadams commented Jan 11, 2019

Uh oh!

AndyAyersMS commented Jan 12, 2019

Uh oh!

AndyAyersMS commented Jan 12, 2019

Uh oh!

fiigii commented Jan 12, 2019

Uh oh!

AndyAyersMS commented Jan 12, 2019

Uh oh!

AndyAyersMS commented Jan 12, 2019

Uh oh!

benaadams commented Jan 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Jan 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Jan 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Jan 14, 2019

Uh oh!

benaadams Jan 14, 2019

Choose a reason for hiding this comment

Uh oh!

briansull Jan 14, 2019

Choose a reason for hiding this comment

Uh oh!

benaadams commented Jan 15, 2019

Uh oh!

AndyAyersMS commented Jan 15, 2019

Uh oh!

redknightlois commented Jan 29, 2019

Uh oh!

AndyAyersMS commented Jan 29, 2019

Uh oh!

AndyAyersMS commented Feb 7, 2019

Uh oh!

AndyAyersMS commented Feb 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

benaadams commented Jan 13, 2019 •

edited

Loading

benaadams commented Jan 13, 2019 •

edited

Loading

benaadams commented Jan 14, 2019 •

edited

Loading