Do not spill eeStack after ldtoken opcode. by sandreenko · Pull Request #10215 · dotnet/coreclr

sandreenko · 2017-03-16T00:38:43Z

There is old code that spills trees on stack every 200 IL opcodes.
I suppose that it was done because there were non-linear recursive algorithms. It happened many years ago before tf source control and I was not able to find who did it.
Now we should not have them and there is no difference between
10 trees * 100 nodes each or one tree with 1000 nodes in it.
This spilling might create perfomance regressions because some
optimizations work only with tree values.

sandreenko · 2017-03-16T05:38:58Z

PTAL @dotnet/jit-contrib
I do not suppose to merge this code, but want to discuss it.
It would be nice to at least rewrite it:

place 200 constant into variable;
find the optimal value for this constant;
if the optimal value will be infinite - delete this code.

Lab tests didn't show any difference in throughput.

JosephTremoulet · 2017-03-16T14:58:12Z

I suppose that it was done because there were non-linear recursive algorithms.

Even linear recursive algorithms would suffer a stack overflow if the recursion got too deep.

Lab tests didn't show any difference in throughput

I'd guess that's because those lab tests are measuring common-case inputs, and this code is an escape valve for "pathological" worst-case inputs. You could try running this change through SPMI to see if any tests suddenly take forever or fail with stack overflow. You could also hand-craft inputs with arbitrarily high stack depths, but without knowing which algorithms might choke there'd be some art/luck to crafting an input that would trigger problematic recursion.

find the optimal value for this constant

It needs to be low enough that we won't fail with stack overflow, and that even for "pathological" inputs we don't spend hours trying to jit them. It needs to be high enough that it won't hamper optimizations in, say, 99.999999% of code.

So I'd suggest coming at it from the other side -- in addition to seeing how high you can make it without our tests falling over, see how low can you make it without our tests having asm diffs. If we have those upper/lower bounds then we can argue about where in that range we'd guess would most often be in the good range for inputs outside our test suite.

JosephTremoulet · 2017-03-16T15:02:17Z

see how low can you make it without our tests having asm diffs

and/or add some instrumentation to measure how big the trees get, run that on top of your change to remove the limit, and build up some histograms of how frequent each tree size is on various inputs

mikedn · 2017-03-16T16:41:21Z

Some context might help. If this is related to #9948 then we're not talking about having better or worse CQ but about failing to recognize certain patterns that the JIT is required to recognize in certain cases.

sandreenko · 2017-03-16T18:08:18Z

Some context might help. If this is related to #9948 then we're not talking about having better or worse CQ but about failing to recognize certain patterns that the JIT is required to recognize in certain cases.

Yes, it is related to that issue. However, Jit works better with trees in all phases and this inconspicuous transformation might cause worse register allocation at least.

mikedn · 2017-03-16T18:34:17Z

However, Jit works better with trees in all phases and this inconspicuous transformation might cause worse register allocation at least.

Yes, that's true but optimizations aren't a must have, you can make a reasonable compromise between tree depth and CQ. But in the case of your issue it's not clear if a compromise is possible and how it may look.

Side note: there's a somewhat related issue - #8119.

JosephTremoulet · 2017-03-16T18:38:19Z

You could also hand-craft inputs with arbitrarily high stack depths, but without knowing which algorithms might choke there'd be some art/luck to crafting an input that would trigger problematic recursion.

Sounds like a worthwhile experiment would be to do this where the hand-crafted input is using the InitializeArray pattern from #9948 -- if such an input blows our stack, then we'll either need a more robust implementation of the transform that CoreRT requires or we'll need to live with a (new, lower) max array rank/size/whatever in CoreRT.

MichalStrehovsky · 2017-03-18T00:57:03Z

or we'll need to live with a (new, lower) max array rank/size/whatever in CoreRT.

@JosephTremoulet The pattern in question is this: https://github.com/dotnet/corefx/blob/f38d75c8a10309b34213e5d4a96313f538194392/src/Microsoft.CSharp/src/Microsoft/CSharp/RuntimeBinder/Semantics/PredefinedMembers.cs#L765

It's not about using big arrays, or arrays with too many dimensions. This is how the relevant part of the IL looks like:

  IL_0066:  ldc.i4.s   109
  IL_0068:  newarr     Microsoft.CSharp.RuntimeBinder.Semantics.PredefinedMethodInfo
  IL_006d:  dup
  IL_006e:  ldc.i4.0
  IL_006f:  ldc.i4.0
  IL_0070:  ldc.i4.1
  IL_0071:  ldc.i4.s   124
  IL_0073:  ldc.i4.s   103
  IL_0075:  ldc.i4.0
  IL_0076:  ldc.i4.5
  IL_0077:  ldc.i4.0
  IL_0078:  ldc.i4.2
  IL_0079:  newarr     [System.Runtime]System.Int32
  IL_007e:  dup
  IL_007f:  ldc.i4.0
  IL_0080:  ldc.i4.s   125
  IL_0082:  stelem.i4
  IL_0083:  newobj     instance void Microsoft.CSharp.RuntimeBinder.Semantics.PredefinedMethodInfo::.ctor(valuetype Microsoft.CSharp.RuntimeBinder.Semantics.PREDEFMETH,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Semantics.MethodRequiredEnum,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Syntax.PredefinedType,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Syntax.PredefinedName,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Semantics.MethodCallingConventionEnum,
                                                                                                          valuetype Microsoft.CSharp.RuntimeBinder.Semantics.ACCESS,
                                                                                                          int32,
                                                                                                          int32[])
  IL_0088:  stelem.ref
  IL_0089:  dup

The above is the 109-element array allocation and storing the first element of it. The lines after IL_006e are repeated 109 times. Note that the compiler didn't bother to create a temporary for the 109-element array, but keeps it on the stack with dups for the entire time it's populating the 109 elements.

Is the problem that the stack is not empty and RyuJIT can't spill it in a clean way, so it will insert an artificial spill?

JosephTremoulet · 2017-03-18T02:14:16Z

What's going on with `InitializeArray`?

I'm still not following this well enough to know. Issue #9948 seems to be talking about this code, which describes a different IL sequence than above. The comments around InitializeArray indicate that it initializes an array from bytes in the image's data section. The IL snippet you pasted is using code to initialize the array with a series of stelems... What am I missing? Is there a rewrite step somewhere that removes the stelems and puts the data inline in the image? What comes after all those stelems, is it a call to InitializeArray, and if so, why did it bother initializing the array with all that explicit code?

What's the JIT doing?

The JIT is just trying to make sure it doesn't create an expression tree (a GenTree in its IR) that's too big to process with a recursive algorithm (the primary limitation of that, I'd suspect, being stack overflow).

This code has a modest max stack depth, why does it trip the JIT's "too big tree" sensor?

For expediency's sake, the JIT is checking "have I read in a lot of IL without hitting an empty evaluation stack?" as a proxy for "am I building a very large expression?"

What do we do now?

That's not clear to me. The JIT almost certainly needs a limit on expression size to avoid crashing on "pathological" inputs (without going through and replacing every recursive algorithm with a worklist-based one...). Apparently (though I still don't understand the details of how, as explained in the first section of this comment), that translates to a limit on input sequences that the JIT can handle without demanding an implementation of InitializeArray from CoreRT. If we can find a way to rewrite the check so that it still avoids JIT crashes (maybe by explicitly checking expression depth/size?) but permits all the input patterns that CoreRT will need, cool. If we can find a way to rewrite the transform in the JIT so that it can identify the InitializeArray sequence regardless of whether this throttling code kicked in, great. I can't really picture how to do either of those without better understanding what patterns need to be handled (and what the IR looks like as we read those patterns in -- I'm imagining that we turn each stelem into a statement in the IR, but if instead we're building a giant comma tree or something then that would be trouble).

MichalStrehovsky · 2017-03-18T06:34:27Z

I'm still not following this well enough to know. Issue #9948 seems to be talking about this code, which describes a different IL sequence than above

Ah, sorry. I didn't realize the first element of the array is so short that the C# compiler initialized it with code - here's the full NGenDump (that is - until we hit the noway assert saying that a mustExpand intrinsic wasn't expanded: ngendump.txt

The first element of the array that is initialized using the helper is at IL_00ba.

JosephTremoulet · 2017-03-20T15:15:14Z

Thanks, I think I'm following now. So we're not trying to pattern-match arbitrarily large IL sequences, we're trying to pattern-match sequences ranging in size from five to sixty-eight IL opcodes. And in the problem case, we just happen to get unlucky in that the "have I read in a lot of IL without hitting an empty evaluation stack?" check hits its limit (of 200 bytes of IL) right in the middle of that 5-instruction sequence (due in this case to an outer array having a long sequence without an empty evaluation stack).

It's a little odd to me that we're trying to pattern-match an IL sequence, but we're doing it after the sequence has already been translated to RyuJit IR. It's nice for throughput that this only kicks in when there's actually a call to InitializeArray, but we open ourselves to issues like this where the IL didn't get translated how we'd expect for whatever reason. Until recently, our processing for dup had some code to look ahead at the next opcode and process some patterns specially; something like that would be more robust.

On the other hand, the code that does the spilling passes the default value of false for spillLeaves, and it looks like the InitializeArray sequence would mostly generate leaves:

ldc generates constants which are leaves
newarr/newobj generate calls which aren't, but
dup would spill the call from newarr/newobj and our InitializeArray pattern-match code is already expecting that
aside from creating that spill, dup should just put two leaves on the stack
I don't think that dup would redundantly spill if the bail-out happened to hit just before it and spill the call

So it seems to me (but someone should verify) that the problem only happens if the "it's been too long since we've had an empty evaluation stack" check kicks in right between the ldtoken and the call to InitializeArray, yes? So then I'd think that simply suppressing the spill if the last opcode was ldtoken would work.

AndyAyersMS · 2017-03-20T15:55:29Z

How about something low tech like passing a flag to the jit indicating "this method contains array initializers" that would then be used to suppress the too many IL opcodes since last spill logic entirely for certain methods.

Or maybe there is some other way to limit the set of methods where they might show up, eg if they only appear in a .cctor or something similar the jit could just key on that to enable spill suppression.

sandreenko · 2017-03-20T20:53:54Z

It doesn't look right to create another workaround in addition to the existing one.
I am planning to do some experiments with it this week:

cause stack overflow without spilling;
find non-linear tree walking algorithms that will prevent us from disabling spilling if the first case is fixed (for example with rewriting tree walking functions with non-recursive versions).

sandreenko · 2017-04-09T23:19:11Z

I updated PR according to @JosephTremoulet proposition. As I understand it is not possible to cause stack overflow after this change. Even more we could forbid spill after all leaves opcodes.

sandreenko · 2017-04-10T00:35:57Z

Created PR with deep tree #10835 to fail the initial PR.

JosephTremoulet

Looks good with some comments.

If you wanted to mitigate against very long sequences of ldtokens, you could add a check that the opcode before ldtoken is dup (which would require tracking prevPrevOpcode or maybe lastDupOffset [similar to delegateCreateStart]), but I believe you're right it's not necessary.

JosephTremoulet · 2017-04-10T13:41:17Z

src/jit/importer.cpp

+//
+// Return Value:
+//    true if it is legal, false if it could be a sequence that we do not want to divide.
+bool Compiler::impCanSpillNow(OPCODE prevOpcode)


Seems odd to shadow the prevOpcode member with a prevOpcode parameter. Did you want to make this static? I'd lean toward making it parameterless.

prevOpcode is not member of compiler, do we want to make it?

No, sorry, I was just misreading (I thought you were loading a member at the callsite, but I see now you're passing a local), it's good how you have it.

JosephTremoulet · 2017-04-10T13:41:48Z

src/jit/importer.cpp

+bool Compiler::impCanSpillNow(OPCODE prevOpcode)
+{
+    StackEntry& lastSE = impStackTop(0);
+    GenTreePtr  tree   = lastSE.val;


Looks like tree is unused (and then so is lastSE).

Yes, I though that it is possible to restore last opcode from the tree, but I was wrong.

JosephTremoulet · 2017-04-10T13:43:14Z

src/jit/importer.cpp

+    GenTreePtr  tree   = lastSE.val;
+    if (prevOpcode == CEE_LDTOKEN)
+    {
+        return false;


This should have a comment referencing/explaining the InitializeArray sequence that we're avoiding breaking up to guarantee that impInitializeArrayIntrinsic can succeed.

pgavlin · 2017-04-10T16:23:29Z

src/jit/importer.cpp

+    {
+        return false;
+    }
+    return true;


Nit: this could be just return prevOpcode != CEE_LDTOKEN;

pgavlin · 2017-04-10T16:24:24Z

src/jit/compiler.h


    //---------------- Spilling the importer stack ----------------------------

+    // The maximum tree size in the importer before it spills all trees from the stack into local variables.


"maximum tree size..." is a bit misleading, as this is actually "maximum number of bytes of IL processed..."

sandreenko · 2017-04-11T01:47:34Z

I fixed comments, please look again.

JosephTremoulet

LGTM

pgavlin

LGTM

Picks up dotnet/coreclr#10215.

sandreenko added * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) area-CodeGen labels Mar 16, 2017

dnfclas added the cla-already-signed label Mar 16, 2017

sandreenko force-pushed the don't-spill-values-in-stack branch from 2fbfc6f to 8391d4d Compare April 9, 2017 20:04

sandreenko changed the title ~~[WIP] Don't spill values on stack without a reason.~~ Add test with deep execution tree without stores. Apr 9, 2017

do not spill eeStack if the last opcode was ldtoken.

5cacd40

sandreenko force-pushed the don't-spill-values-in-stack branch from b6d776e to 5cacd40 Compare April 9, 2017 23:16

sandreenko changed the title ~~Add test with deep execution tree without stores.~~ Do not spill eeStack after ldtoken opcode. Apr 9, 2017

sandreenko removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 10, 2017

JosephTremoulet reviewed Apr 10, 2017

View reviewed changes

pgavlin reviewed Apr 10, 2017

View reviewed changes

src/jit/importer.cpp Outdated

{

return false;

}

return true;

Copy link

pgavlin Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could be just return prevOpcode != CEE_LDTOKEN;

pgavlin reviewed Apr 10, 2017

View reviewed changes

Sergey Andreenko added 2 commits April 10, 2017 10:32

fix comments

d603e46

fix

f8e254c

JosephTremoulet approved these changes Apr 11, 2017

View reviewed changes

pgavlin approved these changes Apr 11, 2017

View reviewed changes

sandreenko merged commit 5bc1e42 into dotnet:master Apr 11, 2017

sandreenko deleted the don't-spill-values-in-stack branch April 11, 2017 03:53

MichalStrehovsky added a commit to MichalStrehovsky/corert that referenced this pull request Apr 26, 2017

Update RyuJIT

19a7854

Picks up dotnet/coreclr#10215.

MichalStrehovsky mentioned this pull request Apr 26, 2017

Update RyuJIT dotnet/corert#3430

Merged

jkotas pushed a commit to dotnet/corert that referenced this pull request Apr 26, 2017

Update RyuJIT (#3430)

e69c7e5

Picks up dotnet/coreclr#10215.

karelz modified the milestone: 2.0.0 Aug 28, 2017

sandreenko mentioned this pull request Oct 10, 2017

Fix InitializeArray intrinsic must always be expanded for CoreRT. #14401

Merged

MichalStrehovsky mentioned this pull request Dec 9, 2017

Champion "Compiler intrinsics (calli, ldftn, etc)" dotnet/csharplang#191

Closed

5 tasks

MichalStrehovsky mentioned this pull request Nov 5, 2018

Updating the importer to throw a NotImplementedException if it finds a mustExpand intrinsic that it can't expand #20792

Merged


		//---------------- Spilling the importer stack ----------------------------

		// The maximum tree size in the importer before it spills all trees from the stack into local variables.

Conversation

sandreenko commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreenko commented Mar 16, 2017

Uh oh!

JosephTremoulet commented Mar 16, 2017

Uh oh!

JosephTremoulet commented Mar 16, 2017

Uh oh!

mikedn commented Mar 16, 2017

Uh oh!

sandreenko commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikedn commented Mar 16, 2017

Uh oh!

JosephTremoulet commented Mar 16, 2017

Uh oh!

MichalStrehovsky commented Mar 18, 2017

Uh oh!

JosephTremoulet commented Mar 18, 2017

What's going on with InitializeArray?

What's the JIT doing?

This code has a modest max stack depth, why does it trip the JIT's "too big tree" sensor?

What do we do now?

Uh oh!

MichalStrehovsky commented Mar 18, 2017

Uh oh!

JosephTremoulet commented Mar 20, 2017

Uh oh!

AndyAyersMS commented Mar 20, 2017

Uh oh!

sandreenko commented Mar 20, 2017

Uh oh!

sandreenko commented Apr 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreenko commented Apr 10, 2017

Uh oh!

JosephTremoulet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandreenko commented Apr 11, 2017

Uh oh!

JosephTremoulet left a comment

Choose a reason for hiding this comment

Uh oh!

pgavlin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

sandreenko commented Mar 16, 2017 •

edited

Loading

sandreenko commented Mar 16, 2017 •

edited

Loading

What's going on with `InitializeArray`?

sandreenko commented Apr 9, 2017 •

edited

Loading