Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Use a smaller expansion of GT_INDEX in MinOpts.#13245

Merged
pgavlin merged 2 commits intodotnet:masterfrom
pgavlin:NoExpandIndex
Aug 25, 2017
Merged

Use a smaller expansion of GT_INDEX in MinOpts.#13245
pgavlin merged 2 commits intodotnet:masterfrom
pgavlin:NoExpandIndex

Conversation

@pgavlin
Copy link

@pgavlin pgavlin commented Aug 6, 2017

We currently expand GT_INDEX nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7

is expanded into this tree:

[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi

Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, GT_INDEX_ADDR, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

This expansion requires only the addition of the GT_IND node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi

@pgavlin
Copy link
Author

pgavlin commented Aug 6, 2017

Here's the throughput data for crossgen System.Private.CoreLib.dll on Windows/x64 as gathered using pin:

Jit PGO Minopts Inst Retired Ratio to resp base
Base No No 1.992e10 --
Base No Yes 1.198e10 --
Base Yes No 1.706e10 --
Base Yes Yes 1.069e10 --
Diff No No 1.990e10 0.999
Diff No Yes 1.172e10 0.978
Diff Yes No 1.710e10 1.002
Diff Yes Yes 1.051e10 0.983

@pgavlin
Copy link
Author

pgavlin commented Aug 6, 2017

cc @AndyAyersMS @CarolEidt

@pgavlin
Copy link
Author

pgavlin commented Aug 6, 2017

@dotnet-bot test Windows_NT x64 Throughput

@AndyAyersMS
Copy link
Member

Very cool.

I think you should do this in debug and rare block cases too, and (soon) in Tier0.

Do you need to carry over the spill logic from the original expansion?

Do you have native size data? Would be interesting to see if the ~1.7% TP win here implies a 2.8% reduction in code size.

It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might.

IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
;; IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi+8*rdi+16]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi

@pgavlin
Copy link
Author

pgavlin commented Aug 6, 2017

I think you should do this in debug and rare block cases too, and (soon) in Tier0.

I thought about this, but opted not to do so in this set of changes in order to avoid teaching the optimizer about the new node. I agree that it is something we ought to look into (it may be easier than I think).

Do you need to carry over the spill logic from the original expansion?

Nope. The spill logic is only necessary because the original expansion needs to make both the array object and the index multi-use. If they are "cheap" and side-effect free, the original expansion simply clones them; if they are expensive or side-effecting, it dumps them into lclVars and uses the lclVars instead. The new expansion does not need to make these values multi-use, it just needs them to be computed into registers (which will then be multiply-used).

Do you have native size data? Would be interesting to see if the ~1.7% TP win here implies a 2.8% reduction in code size.

I do. Here's what jit-diff had to say:

x86: Total bytes of diff: -12 (0.00 % of base)
x86 minopts: Total bytes of diff: -77397 (-0.58 % of base)
x64: Total bytes of diff: -24 (0.00 % of base)
x64 minopts: Total bytes of diff: -124736 (-0.74 % of base)

Some assemblies do better than others. System.Private.CoreLib.dll, for example, improves by over 1% on both x86 minopts and x64 minopts. The diffs in non-minopts are due to two functions in S.P.CoreLib that end up being compiled without optimization (Task.WaitAll and Task.WaitAny). We might want to investigate these at some point.

It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might.

It would be nice, yes. Getting this would require additional work in lowering in order to determine whether or not the GT_INDEX_ADDR node could be contained by its consumer. Unfortunately, that check is not currently terribly cheap due to some of the side-effect analysis that it performs, and would probably negatively affect throughput.

@AndyAyersMS
Copy link
Member

Looks like the WaitAll/WaitAny debug codegen is intentional:

        [MethodImpl(MethodImplOptions.NoOptimization)]  // this is needed for the parallel debugger
        public static void WaitAll(params Task[] tasks)

@stephentoub is this attribute here to ensure this method's frame remains distinct (eg noinline and no tail call?) If so I think marking it as NoInline has the same effect and would give slightly better code.

@AndyAyersMS
Copy link
Member

It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might.

It would be nice, yes. Getting this would require additional work in lowering in order to determine whether or not the GT_INDEX_ADDR node could be contained by its consumer. Unfortunately, that check is not currently terribly cheap due to some of the side-effect analysis that it performs, and would probably negatively affect throughput.

Don't we end up doing containment analysis anyways, for the lea?

@pgavlin
Copy link
Author

pgavlin commented Aug 7, 2017

Don't we end up doing containment analysis anyways, for the lea?

The implementation presented here leaves the GT_INDEX_ADDR node intact all the way to the code generator, so lowering never sees an LEA.

@AndyAyersMS
Copy link
Member

I can understand punting on the rare block case now, but it really seems like you should enable this for debug codegen. While we certainly care about making minopts fast, we care even more about making debug codegen fast.

I looked over the code changes and they look reasonable, but you should get someone else's take on it as well.

@pgavlin
Copy link
Author

pgavlin commented Aug 7, 2017

I can understand punting on the rare block case now, but it really seems like you should enable this for debug codegen. While we certainly care about making minopts fast, we care even more about making debug codegen fast.

I looked into this a bit yesterday, and it doesn't seem as tough as I thought it might be. if anything, it helps illuminate a number of routines that really ought to be handling the new node that these changes do not update.

I looked over the code changes and they look reasonable, but you should get someone else's take on it as well.

Yup. This is just a side project at the moment, so I'm not in any particular rush. I'm hoping that @CarolEidt can take a peek tomorrow once she's back.

@stephentoub
Copy link
Member

@stephentoub is this attribute here to ensure this method's frame remains distinct (eg noinline and no tail call?) If so I think marking it as NoInline has the same effect and would give slightly better code.

I believe that's the case, but @gregg-miskelly could comment for sure. There was a recent discussion related to this in some issue, but I can't seem to find it now.

@gregg-miskelly
Copy link

That code wasn't done by the core debugger team (I think it was done by the PCP debugger team way back in the day), so I can't say with 100% certainty, but the debugger code wants to scrape the arguments (both the this value and the value of tasks being waited on) so I think NoInline wouldn't be enough. Though I think things could be improved by moving the complicated function bodies and moving them to a separate method which can be optimized.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few suggestions

default:
{
// Multiply the index by the element size.
//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could open an issue for this.

#endif // FEATURE_SIMD

#ifndef LEGACY_BACKEND
// In minopts, we expand GT_INDEX to GT_IND(GT_INDEX_ADDR) in order to save on IR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a somewhat expanded comment about the pros and cons.

src/jit/gtlist.h Outdated
GTNODE(COLON , GenTreeColon ,0,GTK_BINOP|GTK_NOTLIR)

GTNODE(INDEX , GenTreeIndex ,0,GTK_BINOP|GTK_EXOP|GTK_NOTLIR) // SZ-array-element
GTNODE(INDEX_ADDR , GenTreeIndex ,0,GTK_BINOP|GTK_EXOP) // addr of SZ-array-element

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to say a bit more here - perhaps "addr of SZ-array-element; used when minimizing JIT compilation time."

//
// Arguments:
// tree - the GT_INDEX_ADDR node
//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are several cases to consider, and that much of this code is comment, do you think it would be useful to put this in codegencommon.cpp, and perhaps even abstract the target-specific bits to avoid #ifdefs?

@pgavlin pgavlin force-pushed the NoExpandIndex branch 2 times, most recently from 0cefca5 to a3c9517 Compare August 15, 2017 16:40
@pgavlin
Copy link
Author

pgavlin commented Aug 15, 2017

@dotnet-bot test Windows_NT x64 minopts

@pgavlin
Copy link
Author

pgavlin commented Aug 15, 2017

@dotnet-bot test Windows_NT x64 Throughput

We currently expand `GT_INDEX` nodes during morph into an explicit
bounds check followed by a load. For example, this tree:

```
[000059] ------------       /--*  LCL_VAR   int    V09 loc6
[000060] R--XG-------    /--*  INDEX     ref
[000058] ------------    |  \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG-------    *  ASG       ref
[000061] D------N----    \--*  LCL_VAR   ref    V10 loc7
```

is expanded into this tree:

```
[000060] R--XG+------       /--*  IND       ref
[000491] -----+------       |  |  /--*  CNS_INT   long   16 Fseq[#FirstElem]
[000492] -----+------       |  \--*  ADD       byref
[000488] -----+-N----       |     |     /--*  CNS_INT   long   3
[000489] -----+------       |     |  /--*  LSH       long
[000487] -----+------       |     |  |  \--*  CAST      long <- int
[000484] i----+------       |     |  |     \--*  LCL_VAR   int    V09 loc6
[000490] -----+------       |     \--*  ADD       byref
[000483] -----+------       |        \--*  LCL_VAR   ref    V00 arg0
[000493] ---XG+------    /--*  COMMA     ref
[000486] ---X-+------    |  \--*  ARR_BOUNDS_CHECK_Rng void
[000059] -----+------    |     +--*  LCL_VAR   int    V09 loc6
[000485] ---X-+------    |     \--*  ARR_LENGTH int
[000058] -----+------    |        \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7

```

Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:

```
IN0018: 000092 mov      rdi, gword ptr [V00 rbp-10H]
IN0019: 000096 mov      edi, dword ptr [rdi+8]
IN001a: 000099 cmp      dword ptr [V09 rbp-48H], edi
IN001b: 00009C jae      G_M5106_IG38
IN001c: 0000A2 mov      rdi, gword ptr [V00 rbp-10H]
IN001d: 0000A6 mov      esi, dword ptr [V09 rbp-48H]
IN001e: 0000A9 movsxd   rsi, esi
IN001f: 0000AC mov      rdi, gword ptr [rdi+8*rsi+16]
IN0020: 0000B1 mov      gword ptr [V10 rbp-50H], rdi
```

Inspired by other recent experiments (e.g. dotnet#13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node, `GT_INDEX_ADDR`, represents the bounds check and address
computation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:

```
[000489] a--XG+------    /--*  IND       ref
[000059] -----+------    |  |  /--*  LCL_VAR   int    V09 loc6
[000060] R--XG+--R---    |  \--*  INDEX_ADDR byref
[000058] -----+------    |     \--*  LCL_VAR   ref    V00 arg0
[000062] -A-XG+------    *  ASG       ref
[000061] D----+-N----    \--*  LCL_VAR   ref    V10 loc7
```

This expansion requires only the addition of the `GT_IND` node that
represents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes:

```
IN0018: 000092 mov      edi, dword ptr [V09 rbp-48H]
IN0019: 000095 mov      rsi, gword ptr [V00 rbp-10H]
IN001a: 000099 cmp      rdi, qword ptr [rsi+8]
IN001b: 00009D jae      G_M5106_IG38
IN001c: 0000A3 lea      rsi, bword ptr [rsi+8*rdi+16]
IN001d: 0000A8 mov      rdi, gword ptr [rsi]
IN001e: 0000AB mov      gword ptr [V10 rbp-50H], rdi
```
@pgavlin
Copy link
Author

pgavlin commented Aug 21, 2017

@dotnet-bot
test Windows_NT x64 minopts
test Windows_NT x64 Throughput

@pgavlin
Copy link
Author

pgavlin commented Aug 21, 2017

@AndyAyersMS @CarolEidt @dotnet/jit-contrib PTAL

I've updated this a bit to make it easier to run the smaller expansion through the optimizer in the future. The primary difference is that rather than using GenTreeIndex for both GT_INDEX and GT_INDEX_ADDR, these changes now use a new struct, GenTreeIndexAddr, for the latter.

@pgavlin
Copy link
Author

pgavlin commented Aug 22, 2017

@dotnet-bot test Windows_NT x86 CoreCLR Perf Tests Correctness

// Furthermore, this representation typically saves on code size in minopts w.r.t. the complete expansion
// performed when optimizing, as it does not require LclVar nodes (which are always stack loads/stores in
// minopts).
if (opts.MinOpts())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think this should be if (opts.MinOpts() || opts.compDbgCode).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I intend to do this with a follow-up change, as it requires a number of changes to various bits of the optimizer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see a statement about the perf costs (i.e. why we don't do this transformation when optimizing)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I've parsed this--you want to see a comment about why we use a larger expansion when optimizing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one nit about having this kick in for debuggable codegen too.

Otherwise looks good.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, with some things I would change.

// Furthermore, this representation typically saves on code size in minopts w.r.t. the complete expansion
// performed when optimizing, as it does not require LclVar nodes (which are always stack loads/stores in
// minopts).
if (opts.MinOpts())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see a statement about the perf costs (i.e. why we don't do this transformation when optimizing)

//
// Arguments:
// tree - the GT_INDEX_ADDR node
//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that there is a lot of common code between this and the xarch implementation that could be factored out and hoisted to codegencommon.cpp to avoid duplication.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree: while the flow of the logic is similar, it differs substantially in whether/how the index is widened and what sequence of instructions is used to generate the address calculation. Refactoring this apart would produce something like

{
    GenTree* const base  = node->Arr();
    GenTree* const index = node->Index();

    genConsumeReg(base);
    genConsumeReg(index);

    // Generate the bounds check if necessary.
    if ((node->gtFlags & GTF_INX_RNGCHK) != 0)
    {
        generateRangeCheckForIndexAddr();
    }

    generateAddressForIndexAddr();

    genProduceReg(node);
}

which I don't think is a substantial improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for expanding the comment.

@pgavlin pgavlin merged commit a7ffdec into dotnet:master Aug 25, 2017
@pgavlin pgavlin deleted the NoExpandIndex branch August 25, 2017 16:18
@karelz karelz modified the milestone: 2.1.0 Aug 28, 2017
briansull added a commit to briansull/coreclr that referenced this pull request Aug 29, 2017
briansull added a commit to briansull/coreclr that referenced this pull request Aug 29, 2017
jkotas added a commit to jkotas/coreclr that referenced this pull request Aug 30, 2017
This reverts commit a7ffdec, reversing
changes made to f5f622d.
jkotas added a commit that referenced this pull request Aug 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants