Do not spill eeStack after ldtoken opcode.#10215
Conversation
|
PTAL @dotnet/jit-contrib
Lab tests didn't show any difference in throughput. |
Even linear recursive algorithms would suffer a stack overflow if the recursion got too deep.
I'd guess that's because those lab tests are measuring common-case inputs, and this code is an escape valve for "pathological" worst-case inputs. You could try running this change through SPMI to see if any tests suddenly take forever or fail with stack overflow. You could also hand-craft inputs with arbitrarily high stack depths, but without knowing which algorithms might choke there'd be some art/luck to crafting an input that would trigger problematic recursion.
It needs to be low enough that we won't fail with stack overflow, and that even for "pathological" inputs we don't spend hours trying to jit them. It needs to be high enough that it won't hamper optimizations in, say, 99.999999% of code. So I'd suggest coming at it from the other side -- in addition to seeing how high you can make it without our tests falling over, see how low can you make it without our tests having asm diffs. If we have those upper/lower bounds then we can argue about where in that range we'd guess would most often be in the good range for inputs outside our test suite. |
and/or add some instrumentation to measure how big the trees get, run that on top of your change to remove the limit, and build up some histograms of how frequent each tree size is on various inputs |
|
Some context might help. If this is related to #9948 then we're not talking about having better or worse CQ but about failing to recognize certain patterns that the JIT is required to recognize in certain cases. |
Yes, it is related to that issue. However, Jit works better with trees in all phases and this inconspicuous transformation might cause worse register allocation at least. |
Yes, that's true but optimizations aren't a must have, you can make a reasonable compromise between tree depth and CQ. But in the case of your issue it's not clear if a compromise is possible and how it may look. Side note: there's a somewhat related issue - #8119. |
Sounds like a worthwhile experiment would be to do this where the hand-crafted input is using the InitializeArray pattern from #9948 -- if such an input blows our stack, then we'll either need a more robust implementation of the transform that CoreRT requires or we'll need to live with a (new, lower) max array rank/size/whatever in CoreRT. |
@JosephTremoulet The pattern in question is this: https://github.com/dotnet/corefx/blob/f38d75c8a10309b34213e5d4a96313f538194392/src/Microsoft.CSharp/src/Microsoft/CSharp/RuntimeBinder/Semantics/PredefinedMembers.cs#L765 It's not about using big arrays, or arrays with too many dimensions. This is how the relevant part of the IL looks like: The above is the 109-element array allocation and storing the first element of it. The lines after Is the problem that the stack is not empty and RyuJIT can't spill it in a clean way, so it will insert an artificial spill? |
What's going on with
|
Ah, sorry. I didn't realize the first element of the array is so short that the C# compiler initialized it with code - here's the full NGenDump (that is - until we hit the noway assert saying that a The first element of the array that is initialized using the helper is at |
|
Thanks, I think I'm following now. So we're not trying to pattern-match arbitrarily large IL sequences, we're trying to pattern-match sequences ranging in size from five to sixty-eight IL opcodes. And in the problem case, we just happen to get unlucky in that the "have I read in a lot of IL without hitting an empty evaluation stack?" check hits its limit (of 200 bytes of IL) right in the middle of that 5-instruction sequence (due in this case to an outer array having a long sequence without an empty evaluation stack). It's a little odd to me that we're trying to pattern-match an IL sequence, but we're doing it after the sequence has already been translated to RyuJit IR. It's nice for throughput that this only kicks in when there's actually a call to On the other hand, the code that does the spilling passes the default value of false for
So it seems to me (but someone should verify) that the problem only happens if the "it's been too long since we've had an empty evaluation stack" check kicks in right between the |
|
How about something low tech like passing a flag to the jit indicating "this method contains array initializers" that would then be used to suppress the too many IL opcodes since last spill logic entirely for certain methods. Or maybe there is some other way to limit the set of methods where they might show up, eg if they only appear in a |
|
It doesn't look right to create another workaround in addition to the existing one.
|
2fbfc6f to
8391d4d
Compare
b6d776e to
5cacd40
Compare
|
I updated PR according to @JosephTremoulet proposition. As I understand it is not possible to cause stack overflow after this change. Even more we could forbid spill after all leaves opcodes. |
|
Created PR with deep tree #10835 to fail the initial PR. |
JosephTremoulet
left a comment
There was a problem hiding this comment.
Looks good with some comments.
If you wanted to mitigate against very long sequences of ldtokens, you could add a check that the opcode before ldtoken is dup (which would require tracking prevPrevOpcode or maybe lastDupOffset [similar to delegateCreateStart]), but I believe you're right it's not necessary.
| // | ||
| // Return Value: | ||
| // true if it is legal, false if it could be a sequence that we do not want to divide. | ||
| bool Compiler::impCanSpillNow(OPCODE prevOpcode) |
There was a problem hiding this comment.
Seems odd to shadow the prevOpcode member with a prevOpcode parameter. Did you want to make this static? I'd lean toward making it parameterless.
There was a problem hiding this comment.
prevOpcode is not member of compiler, do we want to make it?
There was a problem hiding this comment.
No, sorry, I was just misreading (I thought you were loading a member at the callsite, but I see now you're passing a local), it's good how you have it.
src/jit/importer.cpp
Outdated
| bool Compiler::impCanSpillNow(OPCODE prevOpcode) | ||
| { | ||
| StackEntry& lastSE = impStackTop(0); | ||
| GenTreePtr tree = lastSE.val; |
There was a problem hiding this comment.
Looks like tree is unused (and then so is lastSE).
There was a problem hiding this comment.
Yes, I though that it is possible to restore last opcode from the tree, but I was wrong.
src/jit/importer.cpp
Outdated
| GenTreePtr tree = lastSE.val; | ||
| if (prevOpcode == CEE_LDTOKEN) | ||
| { | ||
| return false; |
There was a problem hiding this comment.
This should have a comment referencing/explaining the InitializeArray sequence that we're avoiding breaking up to guarantee that impInitializeArrayIntrinsic can succeed.
src/jit/importer.cpp
Outdated
| { | ||
| return false; | ||
| } | ||
| return true; |
There was a problem hiding this comment.
Nit: this could be just return prevOpcode != CEE_LDTOKEN;
src/jit/compiler.h
Outdated
|
|
||
| //---------------- Spilling the importer stack ---------------------------- | ||
|
|
||
| // The maximum tree size in the importer before it spills all trees from the stack into local variables. |
There was a problem hiding this comment.
"maximum tree size..." is a bit misleading, as this is actually "maximum number of bytes of IL processed..."
|
I fixed comments, please look again. |
Picks up dotnet/coreclr#10215.
Picks up dotnet/coreclr#10215.
There is old code that spills trees on stack every 200 IL opcodes.
I suppose that it was done because there were non-linear recursive algorithms. It happened many years ago before tf source control and I was not able to find who did it.
Now we should not have them and there is no difference between
10 trees * 100 nodes each or one tree with 1000 nodes in it.
This spilling might create perfomance regressions because some
optimizations work only with tree values.