Use AVX2 in codegen for GT_STORE_BLK#55604
Conversation
…, resolves failing tests.
… of allocating a GPR
…avx2-gt-store-blk
|
Related: #33665 |
src/coreclr/jit/codegenxarch.cpp
Outdated
| regSize /= 2; | ||
| } | ||
| // Copy the remainder by moving the last regSize bytes of the buffer | ||
| unsigned remainder = regSize - size; |
There was a problem hiding this comment.
Is it worth an assert that (remainder <= regSize)?
There was a problem hiding this comment.
Sure, that makes sense. I'll add that.
There was a problem hiding this comment.
I don't expect this to have a perf difference, although I'd have to benchmark it to be sure. It seemed easier/cleaner to just reuse the register and regSize that was setup pre-copy. Let me know if you think it's better to break up cases like this instead.
I think that the change to support is it kind of small, you can use the same register just
| unsigned remainder = regSize - size; | |
| if ((regSize > XMM_REGSIZE_BYTES) && (size <= XMM_REGSIZE_BYTES)) | |
| { | |
| regSize = XMM_REGSIZE_BYTES; | |
| } | |
| unsigned remainder = regSize - size; |
Nit: remainder looks confusing for me, because I was expecting size to be called the remainder, but I am not sure what would be a better name.
There was a problem hiding this comment.
size really means remainingSize if I understand correctly, so I would at least rename remainder because the real is size.
| unsigned remainder = regSize - size; | |
| unsigned shiftBack = regSize - size; // so we could use another full move [totalSize - regSize; totalSize). |
…o depend on AVX, remove vzeroupper inserts.
|
@tannergooding I've finished updating the PR with your suggestions, please let me know if there is anything else I can change. |
|
The changes look good/correct to me. CC. @dotnet/jit-contrib, community PR from our friends at AMD |
sandreenko
left a comment
There was a problem hiding this comment.
HI @alexcovington, thanks for the PR, the changes look good to me, with 1 question.
Could you please check if it fixes #33617?
src/coreclr/jit/codegenxarch.cpp
Outdated
| unsigned remainder = regSize - size; | ||
| assert(remainder <= regSize); | ||
|
|
||
| srcOffset -= remainder; |
There was a problem hiding this comment.
is the idea that after it srcOffset + regSize == totalSize?
if we have blockSize = 48, would it be beneficial to copy as [0, 32) YMM [32, 48) XMM, not as [0,32) YMM [16, 48) YMM?
There was a problem hiding this comment.
Could you please check if it fixes #33617?
Sure, I'll take a look and add a separate comment after I check.
is the idea that after it
srcOffset + regSize == totalSize?
Yes, that's the idea. If there are any remaining bytes leftover after the main copy loop, back up the offset enough so that we can do a single SIMD move with whatever register was allocated to finish moving the remaining bytes.
would it be beneficial to copy as [0, 32) YMM [32, 48) XMM, not as [0,32) YMM [16, 48) YMM?
I don't expect this to have a perf difference, although I'd have to benchmark it to be sure. It seemed easier/cleaner to just reuse the register and regSize that was setup pre-copy. Let me know if you think it's better to break up cases like this instead.
|
@sandreenko Unfortunately, it looks like #33617 is not fixed by this change. I'm seeing what you described, the IL looks the same for both the Vector inits: But the generated code still uses XMM registers to complete the assignment to the field: Here's the full dump: Full dump |
|
I see. It is because you are changing only runtime/src/coreclr/jit/codegenxarch.cpp Line 2707 in b412aac could I ask you to do the same change there if you have time? If not I am ok with merging this, it is still a nice improvement. |
|
@sandreenko Sure, I'll take a look at |
|
Yeah, the only difference on modern CPUs should be if the read/write crosses a cache-line or page boundary. Using |
It occurs to me that generating a zero into a register means the "const zero" node should not be marked contained, so codegen puts it into a register. So maybe it shouldn't be contained, but should be marked as a type that codegen would use to create an xmm or ymm zero. It's a bit tricky here because if we actually needed a zero in the GPR and xmm/ymm, we only have a single "const zero" node. |
… remainder instead of step-down
…ft-back for GPR in init instead of step-down register size
|
@BruceForstall I've updated the PR with your suggested changes. Now the remainder is handled with an Now for the test case: [StructLayout(LayoutKind.Sequential, Size = 33)]
private struct Block33 { }
private static Block33 a, b;
static void ReplaceBlock33()
{
a = new();
b = a;
}The following code is produced: Updated ASM diffsaspnet.run.windows.x64.checked.mch:Detail diffsbenchmarks.run.windows.x64.checked.mch:Detail diffscoreclr_tests.pmi.windows.x64.checked.mch:Detail diffslibraries.crossgen2.windows.x64.checked.mch:Detail diffslibraries.pmi.windows.x64.checked.mch:Detail diffsPlease let me know if there are any other changes I can make. Thanks! |
|
Not sure why x86 is failing, currently investigating. |
…avx2-gt-store-blk
…avx2-gt-store-blk
…esolve test failures on x86
|
Resolved x86 failures. Here is the updated ASM diff: ASM Diffsbenchmarks.run.windows.x64.checked.mch:Detail diffscoreclr_tests.pmi.windows.x64.checked.mch:Detail diffslibraries.crossgen2.windows.x64.checked.mch:Detail diffslibraries.pmi.windows.x64.checked.mch:Detail diffs |
…avx2-gt-store-blk
BruceForstall
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the contribution!
|
Improvements on alpine 3.12 dotnet/perf-autofiling-issues#1506 |
|
It looks like a couple of the linked issues are tracked as regressions. We should investigate further and determine why. |
Modifies JIT to allow generating AVX2 instructions when generating code for GT_STORE_BLK nodes when AVX2 is available and the compiler allows for it.
Test case I have been using:
Test Source
Results
Please let me know if I can add any other information. Thanks!