JIT: Avoid xchg for resolution#81216
Conversation
LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available. Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case. Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog. This mostly fixes the string hashcode regressions from dotnet#80743.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsLSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available. Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case. Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog. This mostly fixes the string hashcode regressions from #80743.
|
|
The typical diff looks like: G_M14819_IG09: ; bbWeight=2, gcVars=0000000000000000 {}, gcrefRegs=00000028 {rbx rbp}, byrefRegs=00000000 {}, gcvars, byref
; gcrRegs -[rax] +[rbp]
- xchg rbx, rbp
+ mov rcx, rbx
+ ; gcrRegs +[rcx]
+ mov rbx, rbp
+ mov rbp, rcx
jmp G_M14819_IG03- xchg rdx, rsi
- xchg rsi, r11
- xchg r9, r11
- xchg r10, r11
+ mov ebx, edx
+ mov edx, esi
+ mov esi, r11d
+ mov r11d, r10d
+ mov r10d, r9d
+ mov r9d, ebxMay need to look at TP, let's see. |
|
Example ARM64 diff: - str x24, [fp, #0x38] // [V06 loc3]
- ; GC ptr vars +{V06}
+ mov x0, x24
+ ; gcrRegs +[x0]
mov x24, x25
- ldr x25, [fp, #0x38] // [V06 loc3]
+ mov x25, x0
bge G_M1573_IG30 |
|
@jakobbotsch, what's the perf impact to AMD machines? uops.info reports that on Intel However, for AMD it reports that The concern for the AMD side would be that not using My expectation is that while this will likely have a positive measurable impact to Intel, it may have an inversely negative measurable impact to AMD. |
For the benchmark I am looking at the impact on my AMD CPU is small. Of course I cannot say anything absolute about it. Here's the Intel results: BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22623.1180)
Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-alpha.1.23068.3
[Host] : .NET 8.0.0 (8.0.23.6702), X64 RyuJIT AVX2
Job-HIDAOL : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-VNSHQC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Here's the AMD (5950X) results: BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19044.2486/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-alpha.1.23063.5
[Host] : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2
Job-IIILXS : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-HOELUT : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
There is no increased register pressure, at resolution time all registers have already been assigned and we know if we have free registers. There may be new prolog/epilog saves though.
I think on average this has better perf characteristics due to what you said, with no or very small negative impact on AMD machines and large positive impact on Intel CPUs. Anyway, we have benchmarks that are impacted by this so I think we can get better data from there when/if we merge this, and reevaluate. I'm not against reverting this change if we don't find the tradeoff to be good. |
|
Thanks for checking and sharing results. I agree that this looks reasonable and as long as its not increasing register pressure and therefore leading to significantly worse codegen (such as due to additional spilling required/etc), I think it's fine. -- It might be nice if LSRA could understand I also agree that it would be nice to be able to do per-architecture micro-optimizations, but its a broader issue and something that would need to be done separately (and with consideration to the test matrix complexity increase). |
|
can this use XOR swap when no free register? |
|
Actually the same article mentions that 3 xors will not be faster than on xchg |
Wouldn't that be problematic for gcrefs? |
yes, it would be a problem. |
|
cc @dotnet/jit-contrib PTAL @kunalspathak This SPMI replay failure looks like the timeouts we have been dealing with recently. |
src/coreclr/jit/lsra.cpp
Outdated
| regNumber LinearScan::getTempRegForResolution(BasicBlock* fromBlock, | ||
| BasicBlock* toBlock, | ||
| var_types type, | ||
| VARSET_VALARG_TP sharedCriticalLiveSet) |
There was a problem hiding this comment.
Please add a comment about sharedCriticalLiveSet in method summary.
| } | ||
| else | ||
| { | ||
| if ((freeRegs & RBM_CALLEE_TRASH) != 0) |
There was a problem hiding this comment.
can you add a comment saying we prefer callee-trash registers among the available free temp registers to avoid prolog/epilog save/restore.
| } | ||
| #ifdef TARGET_XARCH | ||
| else | ||
| else if (tempRegInt == REG_NA) |
There was a problem hiding this comment.
I didn't follow this change. What is this about?
There was a problem hiding this comment.
Previously we would always use xchg/swap. Now we only do that if we weren't able to get a temporary register.
|
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Looks like I have some Fuzzlyn failures to look at. |
|
The problem seems to be that bl Program:M58(S0,byte,I0):bool
- str w19, [fp, #0xD1FFAB1E] // [V13 loc10]
+ mov w0, w19
mov x19, x21
- ldr w21, [fp, #0xD1FFAB1E] // [V13 loc10]
+ mov w21, w0
cbz w0, G_M62392_IG163
stp xzr, xzr, [fp, #0xD1FFAB1E]
stp xzr, xzr, [fp, #0xD1FFAB1E]
@@ -3395,7 +3395,7 @@ G_M62392_IG162:
mov x0, x20
mov x2, xzr
blr x3
- ;; size=380 bbWeight=1 PerfScore 134.00
+ ;; size=380 bbWeight=1 PerfScore 132.00
G_M62392_IG163:
ldr x0, [fp, #0x10] // [V528 cse0]
ldr x20, [x0, #0x10]
@@ -3725,7 +3725,7 @@ G_M62392_IG171:
ret lr
;; size=16 bbWeight=1 PerfScore 10.00Note that we pick Not sure why this would only be a problem after this PR. I'll dig a little deeper and push a fix. |
|
/azp run runtime-coreclr libraries-jitstress, Fuzzlyn |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run runtime-coreclr libraries-jitstress, Fuzzlyn |
|
Azure Pipelines successfully started running 2 pipeline(s). |
src/coreclr/jit/lsra.cpp
Outdated
| assert(reg != REG_NA); | ||
| if (reg != REG_STK) | ||
| { | ||
| freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType); |
There was a problem hiding this comment.
| freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType); | |
| freeRegs &= ~genRegMask(reg ARM_ARG( getIntervalForLocalVar(varIndex)->registerType)); |
src/coreclr/jit/lsra.cpp
Outdated
| // resolveType - the type of resolution to be performed | ||
| // liveSet - the set of tracked lclVar indices which may require resolution | ||
| // terminatorConsumedRegs - the registers consumed by the terminator node. | ||
| // These registers will be used after any resolution added to 'fromBlock'. |
There was a problem hiding this comment.
| // These registers will be used after any resolution added to 'fromBlock'. | |
| // These registers will be used after any resolution added at the end of the 'fromBlock'. |
Probably you meant this. But not sure if I fully understand this. Inside getTempRegForResolution0 we say that they are not free to be used as temp registers, but why is that the case? They are consumed already and should be fine to use them, right?
There was a problem hiding this comment.
Yes, I meant that. Will update.
They are consumed already and should be fine to use them, right?
No, resolution nodes are added before the terminating node and any registers it uses. The consumed registers are computed here:
runtime/src/coreclr/jit/lsra.cpp
Lines 7655 to 7726 in f1bdd5a
There was a problem hiding this comment.
resolution nodes are added before the terminating node and any registers it uses.
Of course, that's right. I was thinking something else. But then how did it work so far? Wasn't this problem always?
There was a problem hiding this comment.
I'm not totally sure how it worked previously either. If I had to guess the problematic cases (i.e. switches/JCMP blocks) always end up doing resolution in the target blocks (except for the shared critical edges cases, which was skipped before). I think that would make sense since these kinds of blocks should always have 2 or more successors.
There was a problem hiding this comment.
That could be the case. So were you hitting assert or something when you added support for shared critical edges and that made you consider the terminatorConsumeRegs?
There was a problem hiding this comment.
No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)
There was a problem hiding this comment.
No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)
Ah, I missed that comment and findings. That makes sense then.
There was a problem hiding this comment.
Can you add a test case for that as well?
There was a problem hiding this comment.
The test case is massive, I think weekly Antigen/Fuzzlyn runs should be ok to cover it.
|
/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
The libraries-jitstressregs failures all look like #81366. Will rerun the failing installer job. (Edit: It succeeded on a rerun) |
LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available.
Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case.
Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog.
This mostly fixes the string hashcode regressions from #80743.
Size-wise regressions are expected.