Use a smaller expansion of GT_INDEX in MinOpts.#13245
Conversation
|
Here's the throughput data for
|
|
@dotnet-bot test Windows_NT x64 Throughput |
|
Very cool. I think you should do this in debug and rare block cases too, and (soon) in Tier0. Do you need to carry over the spill logic from the original expansion? Do you have native size data? Would be interesting to see if the ~1.7% TP win here implies a 2.8% reduction in code size. It also sure would be nice to fold the full address calculation into the following load/store. Not sure how hard this would be, or (since it happens late) whether it would help TP much, but it might. |
I thought about this, but opted not to do so in this set of changes in order to avoid teaching the optimizer about the new node. I agree that it is something we ought to look into (it may be easier than I think).
Nope. The spill logic is only necessary because the original expansion needs to make both the array object and the index multi-use. If they are "cheap" and side-effect free, the original expansion simply clones them; if they are expensive or side-effecting, it dumps them into lclVars and uses the lclVars instead. The new expansion does not need to make these values multi-use, it just needs them to be computed into registers (which will then be multiply-used).
I do. Here's what x86: Total bytes of diff: -12 (0.00 % of base) Some assemblies do better than others.
It would be nice, yes. Getting this would require additional work in lowering in order to determine whether or not the |
|
Looks like the WaitAll/WaitAny debug codegen is intentional: [MethodImpl(MethodImplOptions.NoOptimization)] // this is needed for the parallel debugger
public static void WaitAll(params Task[] tasks)@stephentoub is this attribute here to ensure this method's frame remains distinct (eg noinline and no tail call?) If so I think marking it as NoInline has the same effect and would give slightly better code. |
Don't we end up doing containment analysis anyways, for the |
The implementation presented here leaves the |
|
I can understand punting on the rare block case now, but it really seems like you should enable this for debug codegen. While we certainly care about making minopts fast, we care even more about making debug codegen fast. I looked over the code changes and they look reasonable, but you should get someone else's take on it as well. |
I looked into this a bit yesterday, and it doesn't seem as tough as I thought it might be. if anything, it helps illuminate a number of routines that really ought to be handling the new node that these changes do not update.
Yup. This is just a side project at the moment, so I'm not in any particular rush. I'm hoping that @CarolEidt can take a peek tomorrow once she's back. |
I believe that's the case, but @gregg-miskelly could comment for sure. There was a recent discussion related to this in some issue, but I can't seem to find it now. |
|
That code wasn't done by the core debugger team (I think it was done by the PCP debugger team way back in the day), so I can't say with 100% certainty, but the debugger code wants to scrape the arguments (both the |
| default: | ||
| { | ||
| // Multiply the index by the element size. | ||
| // |
There was a problem hiding this comment.
Perhaps you could open an issue for this.
src/jit/morph.cpp
Outdated
| #endif // FEATURE_SIMD | ||
|
|
||
| #ifndef LEGACY_BACKEND | ||
| // In minopts, we expand GT_INDEX to GT_IND(GT_INDEX_ADDR) in order to save on IR. |
There was a problem hiding this comment.
This could use a somewhat expanded comment about the pros and cons.
src/jit/gtlist.h
Outdated
| GTNODE(COLON , GenTreeColon ,0,GTK_BINOP|GTK_NOTLIR) | ||
|
|
||
| GTNODE(INDEX , GenTreeIndex ,0,GTK_BINOP|GTK_EXOP|GTK_NOTLIR) // SZ-array-element | ||
| GTNODE(INDEX_ADDR , GenTreeIndex ,0,GTK_BINOP|GTK_EXOP) // addr of SZ-array-element |
There was a problem hiding this comment.
It might be useful to say a bit more here - perhaps "addr of SZ-array-element; used when minimizing JIT compilation time."
| // | ||
| // Arguments: | ||
| // tree - the GT_INDEX_ADDR node | ||
| // |
There was a problem hiding this comment.
Given that there are several cases to consider, and that much of this code is comment, do you think it would be useful to put this in codegencommon.cpp, and perhaps even abstract the target-specific bits to avoid #ifdefs?
0cefca5 to
a3c9517
Compare
|
@dotnet-bot test Windows_NT x64 minopts |
|
@dotnet-bot test Windows_NT x64 Throughput |
We currently expand `GT_INDEX` nodes during morph into an explicit bounds check followed by a load. For example, this tree: ``` [000059] ------------ /--* LCL_VAR int V09 loc6 [000060] R--XG------- /--* INDEX ref [000058] ------------ | \--* LCL_VAR ref V00 arg0 [000062] -A-XG------- * ASG ref [000061] D------N---- \--* LCL_VAR ref V10 loc7 ``` is expanded into this tree: ``` [000060] R--XG+------ /--* IND ref [000491] -----+------ | | /--* CNS_INT long 16 Fseq[#FirstElem] [000492] -----+------ | \--* ADD byref [000488] -----+-N---- | | /--* CNS_INT long 3 [000489] -----+------ | | /--* LSH long [000487] -----+------ | | | \--* CAST long <- int [000484] i----+------ | | | \--* LCL_VAR int V09 loc6 [000490] -----+------ | \--* ADD byref [000483] -----+------ | \--* LCL_VAR ref V00 arg0 [000493] ---XG+------ /--* COMMA ref [000486] ---X-+------ | \--* ARR_BOUNDS_CHECK_Rng void [000059] -----+------ | +--* LCL_VAR int V09 loc6 [000485] ---X-+------ | \--* ARR_LENGTH int [000058] -----+------ | \--* LCL_VAR ref V00 arg0 [000062] -A-XG+------ * ASG ref [000061] D----+-N---- \--* LCL_VAR ref V10 loc7 ``` Even in this simple case where both the array object and the index are lclVars, this represents a rather large increase in the size of the IR. In the worst case, the JIT introduces and additional lclVar for both the array object and the index, adding several additional nodes to the tree. When optimizing, exposing the structure of the array access may be helpful, as it may allow the compiler to better analyze the program. When we are not optimizing, however, the expansion serves little purpose besides constraining the IR shapes that must be handled by the backend. Due to its need for lclVars in the worst case, this expansion may even bloat the size of the generated code, as all lclVar references are generated as loads/stores from/to the stack when we are not optimizing. In the case above, the expanded tree generates the following x64 assembly: ``` IN0018: 000092 mov rdi, gword ptr [V00 rbp-10H] IN0019: 000096 mov edi, dword ptr [rdi+8] IN001a: 000099 cmp dword ptr [V09 rbp-48H], edi IN001b: 00009C jae G_M5106_IG38 IN001c: 0000A2 mov rdi, gword ptr [V00 rbp-10H] IN001d: 0000A6 mov esi, dword ptr [V09 rbp-48H] IN001e: 0000A9 movsxd rsi, esi IN001f: 0000AC mov rdi, gword ptr [rdi+8*rsi+16] IN0020: 0000B1 mov gword ptr [V10 rbp-50H], rdi ``` Inspired by other recent experiments (e.g. dotnet#13188), this change introduces a new node that replaces the above expansion in MinOpts. This node, `GT_INDEX_ADDR`, represents the bounds check and address computation involved in an array access, and returns the address of the element that is to be loaded or stored. Using this node, the example tree given above expands to the following: ``` [000489] a--XG+------ /--* IND ref [000059] -----+------ | | /--* LCL_VAR int V09 loc6 [000060] R--XG+--R--- | \--* INDEX_ADDR byref [000058] -----+------ | \--* LCL_VAR ref V00 arg0 [000062] -A-XG+------ * ASG ref [000061] D----+-N---- \--* LCL_VAR ref V10 loc7 ``` This expansion requires only the addition of the `GT_IND` node that represents the memory access itself. This savings in IR size translates to about a 2% decrease in instructions retired during non-optimizing compilation. Furthermore, this expansion tends to generate smaller code; for example, the tree given above is generated in 29 rather than 35 bytes: ``` IN0018: 000092 mov edi, dword ptr [V09 rbp-48H] IN0019: 000095 mov rsi, gword ptr [V00 rbp-10H] IN001a: 000099 cmp rdi, qword ptr [rsi+8] IN001b: 00009D jae G_M5106_IG38 IN001c: 0000A3 lea rsi, bword ptr [rsi+8*rdi+16] IN001d: 0000A8 mov rdi, gword ptr [rsi] IN001e: 0000AB mov gword ptr [V10 rbp-50H], rdi ```
|
@dotnet-bot |
|
@AndyAyersMS @CarolEidt @dotnet/jit-contrib PTAL I've updated this a bit to make it easier to run the smaller expansion through the optimizer in the future. The primary difference is that rather than using |
|
@dotnet-bot test Windows_NT x86 CoreCLR Perf Tests Correctness |
| // Furthermore, this representation typically saves on code size in minopts w.r.t. the complete expansion | ||
| // performed when optimizing, as it does not require LclVar nodes (which are always stack loads/stores in | ||
| // minopts). | ||
| if (opts.MinOpts()) |
There was a problem hiding this comment.
Still think this should be if (opts.MinOpts() || opts.compDbgCode).
There was a problem hiding this comment.
I agree. I intend to do this with a follow-up change, as it requires a number of changes to various bits of the optimizer.
There was a problem hiding this comment.
I would still like to see a statement about the perf costs (i.e. why we don't do this transformation when optimizing)
There was a problem hiding this comment.
Okay, I think I've parsed this--you want to see a comment about why we use a larger expansion when optimizing?
AndyAyersMS
left a comment
There was a problem hiding this comment.
Just the one nit about having this kick in for debuggable codegen too.
Otherwise looks good.
CarolEidt
left a comment
There was a problem hiding this comment.
Looks good overall, with some things I would change.
| // Furthermore, this representation typically saves on code size in minopts w.r.t. the complete expansion | ||
| // performed when optimizing, as it does not require LclVar nodes (which are always stack loads/stores in | ||
| // minopts). | ||
| if (opts.MinOpts()) |
There was a problem hiding this comment.
I would still like to see a statement about the perf costs (i.e. why we don't do this transformation when optimizing)
| // | ||
| // Arguments: | ||
| // tree - the GT_INDEX_ADDR node | ||
| // |
There was a problem hiding this comment.
I still feel that there is a lot of common code between this and the xarch implementation that could be factored out and hoisted to codegencommon.cpp to avoid duplication.
There was a problem hiding this comment.
I disagree: while the flow of the logic is similar, it differs substantially in whether/how the index is widened and what sequence of instructions is used to generate the address calculation. Refactoring this apart would produce something like
{
GenTree* const base = node->Arr();
GenTree* const index = node->Index();
genConsumeReg(base);
genConsumeReg(index);
// Generate the bounds check if necessary.
if ((node->gtFlags & GTF_INX_RNGCHK) != 0)
{
generateRangeCheckForIndexAddr();
}
generateAddressForIndexAddr();
genProduceReg(node);
}
which I don't think is a substantial improvement.
CarolEidt
left a comment
There was a problem hiding this comment.
LGTM. Thanks for expanding the comment.
…DEX in MinOpts
…DEX in MinOpts
We currently expand
GT_INDEXnodes during morph into an explicitbounds check followed by a load. For example, this tree:
is expanded into this tree:
Even in this simple case where both the array object and the index are
lclVars, this represents a rather large increase in the size of the IR.
In the worst case, the JIT introduces and additional lclVar for both the
array object and the index, adding several additional nodes to the tree.
When optimizing, exposing the structure of the array access may be
helpful, as it may allow the compiler to better analyze the program.
When we are not optimizing, however, the expansion serves little purpose
besides constraining the IR shapes that must be handled by the backend.
Due to its need for lclVars in the worst case, this expansion may even
bloat the size of the generated code, as all lclVar references are
generated as loads/stores from/to the stack when we are not optimizing.
In the case above, the expanded tree generates the following x64
assembly:
Inspired by other recent experiments (e.g. #13188), this change
introduces a new node that replaces the above expansion in MinOpts. This
node,
GT_INDEX_ADDR, represents the bounds check and addresscomputation involved in an array access, and returns the address of the
element that is to be loaded or stored. Using this node, the example
tree given above expands to the following:
This expansion requires only the addition of the
GT_INDnode thatrepresents the memory access itself. This savings in IR size translates
to about a 2% decrease in instructions retired during non-optimizing
compilation. Furthermore, this expansion tends to generate smaller
code; for example, the tree given above is generated in 29 rather than
35 bytes: