Use a smaller expansion of GT_INDEX in MinOpts. by pgavlin · Pull Request #13245 · dotnet/coreclr

pgavlin · 2017-08-06T18:10:51Z

We currently expand GT_INDEX nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7

is expanded into this tree:

[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, GT_INDEX_ADDR, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

This expansion requires only the addition of the GT_IND node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi

pgavlin · 2017-08-06T18:20:53Z

Here's the throughput data for crossgen System.Private.CoreLib.dll on Windows/x64 as gathered using pin:

Jit	PGO	Minopts	Inst Retired	Ratio to resp base
Base	No	No	1.992e10	--
Base	No	Yes	1.198e10	--
Base	Yes	No	1.706e10	--
Base	Yes	Yes	1.069e10	--
Diff	No	No	1.990e10	0.999
Diff	No	Yes	1.172e10	0.978
Diff	Yes	No	1.710e10	1.002
Diff	Yes	Yes	1.051e10	0.983

pgavlin · 2017-08-06T18:21:08Z

cc @AndyAyersMS @CarolEidt

pgavlin · 2017-08-06T18:29:08Z

@dotnet-bot test Windows_NT x64 Throughput

AndyAyersMS · 2017-08-06T19:22:19Z

Very cool.

I think you should do this in debug and rare block cases too, and (soon) in Tier0.

Do you need to carry over the spill logic from the original expansion?

Do you have native size data? Would be interesting to see if the ~1.7% TP win here implies a 2.8% reduction in code size.

It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might.

IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
;; IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi+8*rdi+16]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi

pgavlin · 2017-08-06T22:34:46Z

I think you should do this in debug and rare block cases too, and (soon) in Tier0.

I thought about this, but opted not to do so in this set of changes in order to avoid teaching the optimizer about the new node. I agree that it is something we ought to look into (it may be easier than I think).

Do you need to carry over the spill logic from the original expansion?

Nope. The spill logic is only necessary because the original expansion needs to make both the array object and the index multi-use. If they are "cheap" and side-effect free, the original expansion simply clones them; if they are expensive or side-effecting, it dumps them into lclVars and uses the lclVars instead. The new expansion does not need to make these values multi-use, it just needs them to be computed into registers (which will then be multiply-used).

Do you have native size data? Would be interesting to see if the ~1.7% TP win here implies a 2.8% reduction in code size.

I do. Here's what jit-diff had to say:

x86: Total bytes of diff: -12 (0.00 % of base)
x86 minopts: Total bytes of diff: -77397 (-0.58 % of base)
x64: Total bytes of diff: -24 (0.00 % of base)
x64 minopts: Total bytes of diff: -124736 (-0.74 % of base)

Some assemblies do better than others. System.Private.CoreLib.dll, for example, improves by over 1% on both x86 minopts and x64 minopts. The diffs in non-minopts are due to two functions in S.P.CoreLib that end up being compiled without optimization (Task.WaitAll and Task.WaitAny). We might want to investigate these at some point.

It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might.

It would be nice, yes. Getting this would require additional work in lowering in order to determine whether or not the GT_INDEX_ADDR node could be contained by its consumer. Unfortunately, that check is not currently terribly cheap due to some of the side-effect analysis that it performs, and would probably negatively affect throughput.

AndyAyersMS · 2017-08-07T01:49:12Z

Looks like the WaitAll/WaitAny debug codegen is intentional:

        [MethodImpl(MethodImplOptions.NoOptimization)]  // this is needed for the parallel debugger
        public static void WaitAll(params Task[] tasks)

@stephentoub is this attribute here to ensure this method's frame remains distinct (eg noinline and no tail call?) If so I think marking it as NoInline has the same effect and would give slightly better code.

AndyAyersMS · 2017-08-07T02:02:29Z

It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might.

It would be nice, yes. Getting this would require additional work in lowering in order to determine whether or not the GT_INDEX_ADDR node could be contained by its consumer. Unfortunately, that check is not currently terribly cheap due to some of the side-effect analysis that it performs, and would probably negatively affect throughput.

Don't we end up doing containment analysis anyways, for the lea?

pgavlin · 2017-08-07T03:07:17Z

Don't we end up doing containment analysis anyways, for the lea?

The implementation presented here leaves the GT_INDEX_ADDR node intact all the way to the code generator, so lowering never sees an LEA.

AndyAyersMS · 2017-08-07T18:28:15Z

I can understand punting on the rare block case now, but it really seems like you should enable this for debug codegen. While we certainly care about making minopts fast, we care even more about making debug codegen fast.

I looked over the code changes and they look reasonable, but you should get someone else's take on it as well.

pgavlin · 2017-08-07T18:31:16Z

I can understand punting on the rare block case now, but it really seems like you should enable this for debug codegen. While we certainly care about making minopts fast, we care even more about making debug codegen fast.

I looked into this a bit yesterday, and it doesn't seem as tough as I thought it might be. if anything, it helps illuminate a number of routines that really ought to be handling the new node that these changes do not update.

I looked over the code changes and they look reasonable, but you should get someone else's take on it as well.

Yup. This is just a side project at the moment, so I'm not in any particular rush. I'm hoping that @CarolEidt can take a peek tomorrow once she's back.

stephentoub · 2017-08-08T02:38:00Z

@stephentoub is this attribute here to ensure this method's frame remains distinct (eg noinline and no tail call?) If so I think marking it as NoInline has the same effect and would give slightly better code.

I believe that's the case, but @gregg-miskelly could comment for sure. There was a recent discussion related to this in some issue, but I can't seem to find it now.

gregg-miskelly · 2017-08-08T02:45:45Z

That code wasn't done by the core debugger team (I think it was done by the PCP debugger team way back in the day), so I can't say with 100% certainty, but the debugger code wants to scrape the arguments (both the this value and the value of tasks being waited on) so I think NoInline wouldn't be enough. Though I think things could be improved by moving the complicated function bodies and moving them to a separate method which can be optimized.

CarolEidt

LGTM with a few suggestions

CarolEidt · 2017-08-08T18:32:06Z

src/jit/codegenxarch.cpp

+        default:
+        {
+            // Multiply the index by the element size.
+            //


Perhaps you could open an issue for this.

CarolEidt · 2017-08-08T18:33:37Z

src/jit/morph.cpp

 #endif // FEATURE_SIMD

+#ifndef LEGACY_BACKEND
+    // In minopts, we expand GT_INDEX to GT_IND(GT_INDEX_ADDR) in order to save on IR.


This could use a somewhat expanded comment about the pros and cons.

CarolEidt · 2017-08-08T18:36:06Z

src/jit/gtlist.h

 GTNODE(COLON            , GenTreeColon       ,0,GTK_BINOP|GTK_NOTLIR)

 GTNODE(INDEX            , GenTreeIndex       ,0,GTK_BINOP|GTK_EXOP|GTK_NOTLIR)   // SZ-array-element
+GTNODE(INDEX_ADDR       , GenTreeIndex       ,0,GTK_BINOP|GTK_EXOP)              // addr of SZ-array-element


It might be useful to say a bit more here - perhaps "addr of SZ-array-element; used when minimizing JIT compilation time."

CarolEidt · 2017-08-08T18:37:43Z

src/jit/codegenarmarch.cpp

+//
+// Arguments:
+//    tree - the GT_INDEX_ADDR node
+//


Given that there are several cases to consider, and that much of this code is comment, do you think it would be useful to put this in codegencommon.cpp, and perhaps even abstract the target-specific bits to avoid #ifdefs?

pgavlin · 2017-08-15T16:41:25Z

@dotnet-bot test Windows_NT x64 minopts

pgavlin · 2017-08-15T16:41:35Z

@dotnet-bot test Windows_NT x64 Throughput

We currently expand `GT_INDEX` nodes during morph into an explicit bounds check followed by a load. For example, this tree: ``` [000059] ------------ /--* LCL_VAR int V09 loc6 [000060] R--XG------- /--* INDEX ref [000058] ------------ | \--* LCL_VAR ref V00 arg0 [000062] -A-XG------- * ASG ref [000061] D------N---- \--* LCL_VAR ref V10 loc7 ``` is expanded into this tree: ``` [000060] R--XG+------ /--* IND ref [000491] -----+------ | | /--* CNS_INT long 16 Fseq[#FirstElem] [000492] -----+------ | \--* ADD byref [000488] -----+-N---- | | /--* CNS_INT long 3 [000489] -----+------ | | /--* LSH long [000487] -----+------ | | | \--* CAST long <- int [000484] i----+------ | | | \--* LCL_VAR int V09 loc6 [000490] -----+------ | \--* ADD byref [000483] -----+------ | \--* LCL_VAR ref V00 arg0 [000493] ---XG+------ /--* COMMA ref [000486] ---X-+------ | \--* ARR_BOUNDS_CHECK_Rng void [000059] -----+------ | +--* LCL_VAR int V09 loc6 [000485] ---X-+------ | \--* ARR_LENGTH int [000058] -----+------ | \--* LCL_VAR ref V00 arg0 [000062] -A-XG+------ * ASG ref [000061] D----+-N---- \--* LCL_VAR ref V10 loc7 ``` Even in this simple case where both the array object and the index are lclVars, this represents a rather large increase in the size of the IR. In the worst case, the JIT introduces and additional lclVar for both the array object and the index, adding several additional nodes to the tree. When optimizing, exposing the structure of the array access may be helpful, as it may allow the compiler to better analyze the program. When we are not optimizing, however, the expansion serves little purpose besides constraining the IR shapes that must be handled by the backend. Due to its need for lclVars in the worst case, this expansion may even bloat the size of the generated code, as all lclVar references are generated as loads/stores from/to the stack when we are not optimizing. In the case above, the expanded tree generates the following x64 assembly: ``` IN0018: 000092 mov rdi, gword ptr [V00 rbp-10H] IN0019: 000096 mov edi, dword ptr [rdi+8] IN001a: 000099 cmp dword ptr [V09 rbp-48H], edi IN001b: 00009C jae G_M5106_IG38 IN001c: 0000A2 mov rdi, gword ptr [V00 rbp-10H] IN001d: 0000A6 mov esi, dword ptr [V09 rbp-48H] IN001e: 0000A9 movsxd rsi, esi IN001f: 0000AC mov rdi, gword ptr [rdi+8*rsi+16] IN0020: 0000B1 mov gword ptr [V10 rbp-50H], rdi ``` Inspired by other recent experiments (e.g. dotnet#13188), this change introduces a new node that replaces the above expansion in MinOpts. This node, `GT_INDEX_ADDR`, represents the bounds check and address computation involved in an array access, and returns the address of the element that is to be loaded or stored. Using this node, the example tree given above expands to the following: ``` [000489] a--XG+------ /--* IND ref [000059] -----+------ | | /--* LCL_VAR int V09 loc6 [000060] R--XG+--R--- | \--* INDEX_ADDR byref [000058] -----+------ | \--* LCL_VAR ref V00 arg0 [000062] -A-XG+------ * ASG ref [000061] D----+-N---- \--* LCL_VAR ref V10 loc7 ``` This expansion requires only the addition of the `GT_IND` node that represents the memory access itself. This savings in IR size translates to about a 2% decrease in instructions retired during non-optimizing compilation. Furthermore, this expansion tends to generate smaller code; for example, the tree given above is generated in 29 rather than 35 bytes: ``` IN0018: 000092 mov edi, dword ptr [V09 rbp-48H] IN0019: 000095 mov rsi, gword ptr [V00 rbp-10H] IN001a: 000099 cmp rdi, qword ptr [rsi+8] IN001b: 00009D jae G_M5106_IG38 IN001c: 0000A3 lea rsi, bword ptr [rsi+8*rdi+16] IN001d: 0000A8 mov rdi, gword ptr [rsi] IN001e: 0000AB mov gword ptr [V10 rbp-50H], rdi ```

pgavlin · 2017-08-21T20:20:29Z

@dotnet-bot
test Windows_NT x64 minopts
test Windows_NT x64 Throughput

pgavlin · 2017-08-21T20:21:57Z

@AndyAyersMS @CarolEidt @dotnet/jit-contrib PTAL

I've updated this a bit to make it easier to run the smaller expansion through the optimizer in the future. The primary difference is that rather than using GenTreeIndex for both GT_INDEX and GT_INDEX_ADDR, these changes now use a new struct, GenTreeIndexAddr, for the latter.

pgavlin · 2017-08-22T03:58:16Z

@dotnet-bot test Windows_NT x86 CoreCLR Perf Tests Correctness

AndyAyersMS · 2017-08-23T22:38:49Z

src/jit/morph.cpp

+    // Furthermore, this representation typically saves on code size in minopts w.r.t. the complete expansion
+    // performed when optimizing, as it does not require LclVar nodes (which are always stack loads/stores in
+    // minopts).
+    if (opts.MinOpts())


Still think this should be if (opts.MinOpts() || opts.compDbgCode).

I agree. I intend to do this with a follow-up change, as it requires a number of changes to various bits of the optimizer.

I would still like to see a statement about the perf costs (i.e. why we don't do this transformation when optimizing)

Okay, I think I've parsed this--you want to see a comment about why we use a larger expansion when optimizing?

Yes, that's right

AndyAyersMS

Just the one nit about having this kick in for debuggable codegen too.

Otherwise looks good.

CarolEidt

Looks good overall, with some things I would change.

CarolEidt · 2017-08-23T23:37:11Z

src/jit/morph.cpp

+    // Furthermore, this representation typically saves on code size in minopts w.r.t. the complete expansion
+    // performed when optimizing, as it does not require LclVar nodes (which are always stack loads/stores in
+    // minopts).
+    if (opts.MinOpts())


I would still like to see a statement about the perf costs (i.e. why we don't do this transformation when optimizing)

CarolEidt · 2017-08-23T23:40:01Z

src/jit/codegenarmarch.cpp

+//
+// Arguments:
+//    tree - the GT_INDEX_ADDR node
+//


I still feel that there is a lot of common code between this and the xarch implementation that could be factored out and hoisted to codegencommon.cpp to avoid duplication.

I disagree: while the flow of the logic is similar, it differs substantially in whether/how the index is widened and what sequence of instructions is used to generate the address calculation. Refactoring this apart would produce something like

{ GenTree* const base = node->Arr(); GenTree* const index = node->Index(); genConsumeReg(base); genConsumeReg(index); // Generate the bounds check if necessary. if ((node->gtFlags & GTF_INX_RNGCHK) != 0) { generateRangeCheckForIndexAddr(); } generateAddressForIndexAddr(); genProduceReg(node); }

which I don't think is a substantial improvement.

OK, makes sense.

CarolEidt

LGTM. Thanks for expanding the comment.

…DEX in MinOpts

This reverts commit a7ffdec, reversing changes made to f5f622d.

dnfclas added the cla-already-signed label Aug 6, 2017

pgavlin force-pushed the NoExpandIndex branch from 758d265 to 3413df6 Compare August 6, 2017 18:28

CarolEidt approved these changes Aug 8, 2017

View reviewed changes

pgavlin added the JitThroughput label Aug 8, 2017

pgavlin force-pushed the NoExpandIndex branch 2 times, most recently from 0cefca5 to a3c9517 Compare August 15, 2017 16:40

pgavlin force-pushed the NoExpandIndex branch from a3c9517 to 2d97f40 Compare August 21, 2017 16:00

pgavlin force-pushed the NoExpandIndex branch from 2d97f40 to cdec3b5 Compare August 21, 2017 16:58

AndyAyersMS reviewed Aug 23, 2017

View reviewed changes

CarolEidt suggested changes Aug 23, 2017

View reviewed changes

Address PR feedback.

71b9152

CarolEidt approved these changes Aug 25, 2017

View reviewed changes

pgavlin merged commit a7ffdec into dotnet:master Aug 25, 2017

pgavlin deleted the NoExpandIndex branch August 25, 2017 16:18

karelz modified the milestone: 2.1.0 Aug 28, 2017

briansull added a commit to briansull/coreclr that referenced this pull request Aug 29, 2017

Additional fix for PR dotnet#13245 - Use a smaller expansion of GT_IN…

d9d5c9a

…DEX in MinOpts

briansull mentioned this pull request Aug 29, 2017

Fixes "Assertion failed 'b'" with JitStress #13665

Closed

briansull added a commit to briansull/coreclr that referenced this pull request Aug 29, 2017

Additional fix for PR dotnet#13245 - Use a smaller expansion of GT_IN…

9cf72ec

…DEX in MinOpts

jkotas added a commit to jkotas/coreclr that referenced this pull request Aug 30, 2017

Revert "Merge pull request dotnet#13245 from pgavlin/NoExpandIndex"

db18d8d

This reverts commit a7ffdec, reversing changes made to f5f622d.

jkotas added a commit that referenced this pull request Aug 30, 2017

Revert "Merge pull request #13245 from pgavlin/NoExpandIndex" (#13682)

bec6ac1

This reverts commit a7ffdec, reversing changes made to f5f622d.

briansull mentioned this pull request Mar 14, 2018

Fix GH Issue 16892 - GC hole due to GT_INDEX_ADDR #16928

Merged

AndyAyersMS mentioned this pull request Sep 20, 2018

JIT: Fix operand evaluation order for GT_INDEX_ADDR (port of #20047) #20083

Closed

AndyAyersMS mentioned this pull request Jan 24, 2019

Intrinsicify .SequenceCompareTo(byte, ...) #22127

Merged

Conversation

pgavlin commented Aug 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgavlin commented Aug 6, 2017

Uh oh!

pgavlin commented Aug 6, 2017

Uh oh!

pgavlin commented Aug 6, 2017

Uh oh!

AndyAyersMS commented Aug 6, 2017

Uh oh!

pgavlin commented Aug 6, 2017

Uh oh!

AndyAyersMS commented Aug 7, 2017

Uh oh!

AndyAyersMS commented Aug 7, 2017

Uh oh!

pgavlin commented Aug 7, 2017

Uh oh!

AndyAyersMS commented Aug 7, 2017

Uh oh!

pgavlin commented Aug 7, 2017

Uh oh!

stephentoub commented Aug 8, 2017

Uh oh!

gregg-miskelly commented Aug 8, 2017

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgavlin commented Aug 15, 2017

Uh oh!

pgavlin commented Aug 15, 2017

Uh oh!

pgavlin commented Aug 21, 2017

Uh oh!

pgavlin commented Aug 21, 2017

Uh oh!

pgavlin commented Aug 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

pgavlin commented Aug 6, 2017 •

edited

Loading