Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

WIP: Peephole optimize redundant and shuffle moves#21959

Closed
AndyAyersMS wants to merge 3 commits intodotnet:masterfrom
AndyAyersMS:PeepholeRedundantMoves
Closed

WIP: Peephole optimize redundant and shuffle moves#21959
AndyAyersMS wants to merge 3 commits intodotnet:masterfrom
AndyAyersMS:PeepholeRedundantMoves

Conversation

@AndyAyersMS
Copy link
Member

Proof of concept for a late peephole optimization.

Take advantage of the fact that the emitter tracks the last emitted instruction and that IGs do not span basic blocks to do a simple peephole optimization that can suppress redundant (mov rax, rdx; mov rax rdx) and shuffle (mov rax, rdx; mov rdx, rax) moves.

In combination these two address simple the cases of shuffle move chains like

mov rax, rdx
mov rdx, rax      ==>     mov rax, rdx
mov rax, rdx

The second mov is supressed as shuffle and so the third move becomes redundant.

@dotnet/jit-contrib curious to hear your thoughts on this approach. It would serve as a complementary piece and is not intended as the primary solution to various CQ issues. And it could give us a template for doing other simple-minded peephole opts.

We could also look back further (with some extra work) to catch the two-register variant which we sometimes see from promoted Span<T> copies.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -20032 (-0.05% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -4359 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.08% of base)
       -2809 : System.Private.CoreLib.dasm (-0.07% of base)
       -2497 : Microsoft.CodeAnalysis.CSharp.dasm (-0.06% of base)
       -1474 : System.Private.Xml.dasm (-0.04% of base)
        -834 : System.Memory.dasm (-0.51% of base)
90 total files with size differences (90 improved, 0 regressed), 39 unchanged.
Top method improvements by size (bytes):
        -135 (-2.61% of base) : System.IO.FileSystem.dasm - FileSystemEnumerator`1:MoveNext():bool:this (5 methods)
         -90 (-2.29% of base) : System.Private.CoreLib.dasm - TextInfo:ChangeCaseCommon(ref):ref:this (6 methods)
         -90 (-4.27% of base) : System.Private.CoreLib.dasm - EventPipePayloadDecoder:DecodePayload(byref,struct):ref
         -73 (-1.25% of base) : System.Security.Cryptography.Algorithms.dasm - KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,struct,ref,byref,byref) (5 methods)
         -66 (-2.12% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Scanner:ScanXmlString(ushort,ushort,bool):ref:this
Top method improvements by size (percentage):
          -6 (-15.79% of base) : Microsoft.CodeAnalysis.dasm - ModuleMetadata:CreateFromFile(ref):ref
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - CustomEventAccessorSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetReturnTypeAttributeDeclarations():struct:this
         -22 (-11.40% of base) : System.Private.Xml.dasm - NameTable:ComputeHash32(ref,int,int):int
3138 total methods with size differences (3138 improved, 0 regressed), 190301 unchanged.

Sample cherry-picked diff:

;; System.Private.Xml      
;; NameTable:ComputeHash32(ref,int,int):int

 G_M17318_IG06:
        mov      rcx, rax
-       mov      rax, rcx
        imul     ecx, r9d, 2
        jo       SHORT G_M17318_IG08
        mov      rdx, rax               // this is also now dead....
-       mov      rax, rdx
-       mov      rdx, rax
-       mov      rax, rdx
-       mov      rdx, rax
-       mov      rax, rdx
        mov      rsi, rax
        mov      edi, ecx
        mov      rcx, 0xD1FFAB1E
        mov      edx, 100
        call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE

@benaadams
Copy link
Member

We could also look back further (with some extra work) to catch the two-register variant which we sometimes see from promoted Span<T> copies.

Was just looking at this in Kestrel HttpParser`1:ParseRequestLine(struct,byref,byref,byref):bool:this

G_M24661_IG39:
       488BCB               mov      rcx, rbx
       488BD9               mov      rbx, rcx
       418BF5               mov      esi, r13d
       488BCB               mov      rcx, rbx
       448BEE               mov      r13d, esi
       488BD9               mov      rbx, rcx
       418BF5               mov      esi, r13d
       EB21                 jmp      SHORT G_M24661_IG43

That kinda thing?

@AndyAyersMS
Copy link
Member Author

Yes, exactly that.

@4creators
Copy link

Any results from benchmarks?

// Sure would be nice to just provisionally emit the new instruction and compare
// instrDescs, but that's not possible just yet.
//
// Do we need any other safety checks here? Maybe same gcness?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I dealt with the gc reporting, but I think it's the case that the emitter does the final reporting, so I don't think there'd be an impact because we haven't actually emitted this instruction yet, and the gcness of the registers isn't changing.

@CarolEidt
Copy link

This is interesting, and although it's a bit messy, it's probably pretty cheap.

@benaadams
Copy link
Member

Tizen is some artifact copy issue:

First time build. Skipping changelog.
Run condition [Current build status] enabling prebuild for step [[Archive the artifacts]]
Copied 1 artifact from "dotnet_coreclr » master » checked_windows_nt_bld" build number 38414
ERROR: Unable to find project for artifact copy: dotnet_corefx/master/tizen_armel_cross_release
This may be due to incorrect project name or permission settings; see help for project name in job configuration.
[Current build status] check if current [FAILURE] is worse or equals then [SUCCESS] and better or equals then [SUCCESS]
Run condition [Current build status] preventing perform for step [[Archive the artifacts]]
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of fcb12fff72cdf317b99169339035a416589ba251 to FAILURE with url https://ci.dot.net/job/dotnet_coreclr/job/master/job/armel_cross_checked_tizen_innerloop_prtest/7438/ and message: 'Build finished. '
Using context: Tizen armel Cross Checked Innerloop Build and Test
[WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done
Finished: FAILURE

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

@AndyAyersMS
Copy link
Member Author

Updated to look back two instructions (when safe to do so).

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -22293 (-0.06% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -4509 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.08% of base)
       -3387 : System.Private.CoreLib.dasm (-0.09% of base)
       -2552 : Microsoft.CodeAnalysis.CSharp.dasm (-0.06% of base)
       -1519 : System.Private.Xml.dasm (-0.04% of base)
       -1230 : System.Memory.dasm (-0.76% of base)
90 total files with size differences (90 improved, 0 regressed), 39 unchanged.
Top method improvements by size (bytes):
        -135 (-2.61% of base) : System.IO.FileSystem.dasm - FileSystemEnumerator`1:MoveNext():bool:this (5 methods)
        -112 (-7.61% of base) : System.Private.CoreLib.dasm - Memory`1:CopyTo(struct):this (5 methods)
        -112 (-7.47% of base) : System.Private.CoreLib.dasm - ReadOnlyMemory`1:CopyTo(struct):this (5 methods)
         -99 (-4.70% of base) : System.Private.CoreLib.dasm - EventPipePayloadDecoder:DecodePayload(byref,struct):ref
         -90 (-2.29% of base) : System.Private.CoreLib.dasm - TextInfo:ChangeCaseCommon(ref):ref:this (6 methods)
Top method improvements by size (percentage):
          -6 (-15.79% of base) : Microsoft.CodeAnalysis.dasm - ModuleMetadata:CreateFromFile(ref):ref
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - CustomEventAccessorSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetReturnTypeAttributeDeclarations():struct:this
         -22 (-11.40% of base) : System.Private.Xml.dasm - NameTable:ComputeHash32(ref,int,int):int
3166 total methods with size differences (3166 improved, 0 regressed), 190273 unchanged.

Another cherry-picked diff showing a double shuffle getting zapped:

 ;; System.Memory
 ;; BuffersExtensions:CopyTo(byref,struct)
 G_M61410_IG18:
        mov      r8, r14
-       mov      r14, r8
        mov      r8d, r15d
        mov      rdx, r14
-       mov      r15d, r8d
-       mov      r14, rdx
-       mov      r8d, r15d
        cmp      r8d, ebx
        ja       SHORT G_M61410_IG26
        mov      rdx, r14
        movsxd   r8, r8d
        shl      r8, 5
        mov      rcx, rdi
        call     Buffer:Memmove(byref,byref,long)

@AndyAyersMS
Copy link
Member Author

Thinking I may move the instrDesc specific checks over to the emitter. This would let us generalize to a broader set of forms, in particular for the interferferesWith style checks.

@fiigii
Copy link

fiigii commented Jan 12, 2019

Perhaps we should JitDump the skipped instructions so that we could root cause the redundancy if needed in the future.

@AndyAyersMS
Copy link
Member Author

OSX failed during build, unable to write an exectuable:

17:39:52 [ 60%] Linking CXX executable paltest_swscanf_test14
17:39:53 ld: can't write to output file: paltest_swscanf_test14, errno=28 for architecture x86_64
17:39:53 clang: error: linker command failed with exit code 1 (use -v to see invocation)
17:39:53 [ 60%] Built target paltest_swscanf_test12
17:39:53 make[2]: *** [src/pal/tests/palsuite/c_runtime/swscanf/test14/paltest_swscanf_test14] Error 1

@AndyAyersMS
Copy link
Member Author

There is no actual instruction (instrDesc) to dump, though we could fake one up or fairly simply describe what it would look like.

Things seem fairly clear from the current jit dump, at least to me ...

Generating: N089 (  7,  6) [000158] DA--G-------              *  STORE_LCL_VAR byref  V09 tmp8         d:1 rcx REG rcx
							Byref regs: 00000001 {rax} => 00000000 {}
IN0010:        mov      rcx, rax
							V09 in reg rcx is becoming live  [000158]
							Live regs: 00000004 {rdx} => 00000006 {rcx rdx}
							Live vars: {V04} => {V04 V09}
							Byref regs: 00000000 {} => 00000002 {rcx}
genIPmappingAdd: ignoring duplicate IL offset 0x0
Generating: N091 (  7,  5) [000116] ------------                 IL_OFFSET void   IL offset: 0x0 REG NA
Generating: N093 (  3,  2) [000156] ------------       t156 =    LCL_VAR   byref  V09 tmp8         u:1 rcx (last use) REG rcx <l:$103, c:$102>
                                                             /--*  t156   byref  
Generating: N095 (  7,  5) [000114] DA----------              *  STORE_LCL_VAR byref  V14 tmp13        d:1 rax REG rax
							V09 in reg rcx is becoming dead  [000156]
							Live regs: 00000006 {rcx rdx} => 00000004 {rdx}
							Live vars: {V04 V09} => {V04}
							Byref regs: 00000002 {rcx} => 00000000 {}

-- skipping emission -- previous instr IN0010 makes this redundant
							V14 in reg rax is becoming live  [000114]
							Live regs: 00000004 {rdx} => 00000005 {rax rdx}
							Live vars: {V04} => {V04 V14}
							Byref regs: 00000000 {} => 00000001 {rax}

The liveness updates for suppressed instructions seem a bit scary, but are necessary.

For example, in the case above we have a byref in rax and we track gc liveness changes as follows, where "gc live" is the set of gc live regs after the instruction is executed, and the `"..." shows transient GC live states that get overwritten. Because the peepholes are done independently we need to show that we can do any subset of them and still end up with correctly described GC states. And this only works out if liveness updates for those instructions are processed even though the instructions themselves are dropped.

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       rcx
   mov rax, rcx    -rcx +rax       rax
   mov rcx, rax    -rax +rcx       rcx

with the full peephole

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       ... (rcx)
                   -rcx +rax       ... (rax)
                   -rax +rcx       rcx

with just the first instruction peephole

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       ... (rcx)
                   -rcx +rax       rax
   mov rcx, rax    -rax +rcx       rcx

with just the second instruction peephole

   instr           gc change       gc live
   -----           ---------       -------
   mov rcx, rax    -rax +rcx       rcx
   mov rax, rcx    -rcx +rax       ... (rax)
                   -rax +rcx       rcx

@benaadams
Copy link
Member

benaadams commented Jan 13, 2019

Possibly something that could benefit from a similar optimization? I'm seeing this in async .MoveNext() methods:

       488B4D10             mov      rcx, bword ptr [rbp+10H]  ; load (rcx) [rbp+10H]
       895104               mov      dword ptr [rcx+4], edx
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]
       837A0464             cmp      dword ptr [rdx+4], 100
       0F8C01FFFFFF         jl       G_M10160_IG04

G_M10160_IG09:
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]
       C702FEFFFFFF         mov      dword ptr [rdx], -2
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]
       3912                 cmp      dword ptr [rdx], edx
       488B5510             mov      rdx, bword ptr [rbp+10H]  ; load (rdx) [rbp+10H]

@benaadams
Copy link
Member

benaadams commented Jan 13, 2019

Hmm... maybe not, might be volatile?

@benaadams
Copy link
Member

benaadams commented Jan 14, 2019

@danmoseley
Copy link
Member

Will this potentially conceal future regressions in code quality in earlier stages?

//
// We could also think about handling reg-mem moves in a similar manner,
// especially when mem is a stack access. This could give us writethrough like
// codegen for EH methods and (perhaps, if we dared) simpler code for minopts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reg-mem moves would also improve all async methods? https://github.com/dotnet/coreclr/issues/21973

// eg mov [rbp + 40], rax
// mov rax, [rbp + 40] -- can omit the "reload"
//
bool CodeGen::isRedundantMove(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also implement this for ARM64 and ARM32

@benaadams
Copy link
Member

Could the peephole also forward? e.g. (from https://github.com/dotnet/coreclr/issues/21973#issuecomment-454217381)

       mov      dword ptr [rsi+4], edx
       cmp      dword ptr [rsi+4], 100

to

       mov      dword ptr [rsi+4], edx
       cmp      edx, 100

@AndyAyersMS
Copy link
Member Author

Something like this is probably better done earlier when we have a better model for memory, but yeah we might be able to find cases where we can change instructions like this.

@redknightlois
Copy link

@AndyAyersMS Could this method be generalized to lookback way more than 2 if AggressiveOptimization flag is on for a certain method? When trying to optimize SequenceCompareTo which is a very core method these redundant chains kill performance.

@AndyAyersMS
Copy link
Member Author

I would hope #19429 would get rid of most cases of these shuffle chains.

@AndyAyersMS
Copy link
Member Author

Need to re-evaluate now that #19429 is in.

Also, should follow the guidance that led to revision of #22454 and not expose emitter details to the code generator.

@AndyAyersMS
Copy link
Member Author

Looks like the recent preferencing changes got about 90% of the cases above.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -1135 (-0.00% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -739 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.01% of base)
        -128 : System.Private.Xml.dasm (-0.00% of base)
         -71 : System.Private.CoreLib.dasm (-0.00% of base)
         -24 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base)
         -18 : Newtonsoft.Json.dasm (-0.00% of base)
30 total files with size differences (30 improved, 0 regressed), 99 unchanged.
Top method improvements by size (bytes):
         -27 (-1.03% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseExitStatement():ref:this
         -15 (-0.53% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseInterpolatedStringInterpolation():ref:this
         -12 (-1.09% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseContinueStatement():ref:this
         -12 (-0.63% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseCaseStatement():ref:this
         -12 (-1.07% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseXmlElementStartTag(int):ref:this
Top method improvements by size (percentage):
          -3 (-2.94% of base) : System.ComponentModel.TypeConverter.dasm - InterlockedBitVector32:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - EventToken:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - MethodToken:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - ParameterToken:Equals(ref):bool:this
          -3 (-2.94% of base) : System.Private.CoreLib.dasm - PropertyToken:Equals(ref):bool:this
292 total methods with size differences (292 improved, 0 regressed), 193009 unchanged.

I'm going to shelve this for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants