WIP: Peephole optimize redundant and shuffle moves#21959
WIP: Peephole optimize redundant and shuffle moves#21959AndyAyersMS wants to merge 3 commits intodotnet:masterfrom
Conversation
Was just looking at this in Kestrel HttpParser`1:ParseRequestLine(struct,byref,byref,byref):bool:this G_M24661_IG39:
488BCB mov rcx, rbx
488BD9 mov rbx, rcx
418BF5 mov esi, r13d
488BCB mov rcx, rbx
448BEE mov r13d, esi
488BD9 mov rbx, rcx
418BF5 mov esi, r13d
EB21 jmp SHORT G_M24661_IG43That kinda thing? |
|
Yes, exactly that. |
|
Any results from benchmarks? |
src/jit/codegenxarch.cpp
Outdated
| // Sure would be nice to just provisionally emit the new instruction and compare | ||
| // instrDescs, but that's not possible just yet. | ||
| // | ||
| // Do we need any other safety checks here? Maybe same gcness? |
There was a problem hiding this comment.
It's been a while since I dealt with the gc reporting, but I think it's the case that the emitter does the final reporting, so I don't think there'd be an impact because we haven't actually emitted this instruction yet, and the gcness of the registers isn't changing.
|
This is interesting, and although it's a bit messy, it's probably pretty cheap. |
|
Tizen is some artifact copy issue: @dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test |
|
Updated to look back two instructions (when safe to do so). Another cherry-picked diff showing a double shuffle getting zapped: ;; System.Memory
;; BuffersExtensions:CopyTo(byref,struct)
G_M61410_IG18:
mov r8, r14
- mov r14, r8
mov r8d, r15d
mov rdx, r14
- mov r15d, r8d
- mov r14, rdx
- mov r8d, r15d
cmp r8d, ebx
ja SHORT G_M61410_IG26
mov rdx, r14
movsxd r8, r8d
shl r8, 5
mov rcx, rdi
call Buffer:Memmove(byref,byref,long) |
|
Thinking I may move the |
|
Perhaps we should JitDump the skipped instructions so that we could root cause the redundancy if needed in the future. |
|
OSX failed during build, unable to write an exectuable: |
|
There is no actual instruction ( Things seem fairly clear from the current jit dump, at least to me ... The liveness updates for suppressed instructions seem a bit scary, but are necessary. For example, in the case above we have a byref in with the full peephole with just the first instruction peephole with just the second instruction peephole |
|
Possibly something that could benefit from a similar optimization? I'm seeing this in async 488B4D10 mov rcx, bword ptr [rbp+10H] ; load (rcx) [rbp+10H]
895104 mov dword ptr [rcx+4], edx
488B5510 mov rdx, bword ptr [rbp+10H] ; load (rdx) [rbp+10H]
837A0464 cmp dword ptr [rdx+4], 100
0F8C01FFFFFF jl G_M10160_IG04
G_M10160_IG09:
488B5510 mov rdx, bword ptr [rbp+10H] ; load (rdx) [rbp+10H]
C702FEFFFFFF mov dword ptr [rdx], -2
488B5510 mov rdx, bword ptr [rbp+10H] ; load (rdx) [rbp+10H]
3912 cmp dword ptr [rdx], edx
488B5510 mov rdx, bword ptr [rbp+10H] ; load (rdx) [rbp+10H] |
|
Hmm... maybe not, might be volatile? |
|
Raised issue https://github.com/dotnet/coreclr/issues/21973 |
|
Will this potentially conceal future regressions in code quality in earlier stages? |
| // | ||
| // We could also think about handling reg-mem moves in a similar manner, | ||
| // especially when mem is a stack access. This could give us writethrough like | ||
| // codegen for EH methods and (perhaps, if we dared) simpler code for minopts |
There was a problem hiding this comment.
I think reg-mem moves would also improve all async methods? https://github.com/dotnet/coreclr/issues/21973
| // eg mov [rbp + 40], rax | ||
| // mov rax, [rbp + 40] -- can omit the "reload" | ||
| // | ||
| bool CodeGen::isRedundantMove( |
There was a problem hiding this comment.
We should also implement this for ARM64 and ARM32
|
Could the peephole also forward? e.g. (from https://github.com/dotnet/coreclr/issues/21973#issuecomment-454217381) mov dword ptr [rsi+4], edx
cmp dword ptr [rsi+4], 100to mov dword ptr [rsi+4], edx
cmp edx, 100 |
|
Something like this is probably better done earlier when we have a better model for memory, but yeah we might be able to find cases where we can change instructions like this. |
|
@AndyAyersMS Could this method be generalized to lookback way more than 2 if AggressiveOptimization flag is on for a certain method? When trying to optimize |
|
I would hope #19429 would get rid of most cases of these shuffle chains. |
|
Looks like the recent preferencing changes got about 90% of the cases above. I'm going to shelve this for now. |
Proof of concept for a late peephole optimization.
Take advantage of the fact that the emitter tracks the last emitted instruction and that IGs do not span basic blocks to do a simple peephole optimization that can suppress redundant (
mov rax, rdx; mov rax rdx) and shuffle (mov rax, rdx; mov rdx, rax) moves.In combination these two address simple the cases of shuffle move chains like
The second mov is supressed as shuffle and so the third move becomes redundant.
@dotnet/jit-contrib curious to hear your thoughts on this approach. It would serve as a complementary piece and is not intended as the primary solution to various CQ issues. And it could give us a template for doing other simple-minded peephole opts.
We could also look back further (with some extra work) to catch the two-register variant which we sometimes see from promoted
Span<T>copies.Sample cherry-picked diff:
;; System.Private.Xml ;; NameTable:ComputeHash32(ref,int,int):int G_M17318_IG06: mov rcx, rax - mov rax, rcx imul ecx, r9d, 2 jo SHORT G_M17318_IG08 mov rdx, rax // this is also now dead.... - mov rax, rdx - mov rdx, rax - mov rax, rdx - mov rdx, rax - mov rax, rdx mov rsi, rax mov edi, ecx mov rcx, 0xD1FFAB1E mov edx, 100 call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE