[WIP] Enable cast operand containment#12676
Conversation
|
jit-diff summary: Sample diffs: movzx r8, byte ptr [rcx+17]
- mov r8d, r8d
mov edx, dword ptr [rsi+rcx+40]
- mov edx, edx
- movsx rcx, word ptr [rsi+8]
- movsxd rax, ecx
+ movsx rax, word ptr [rsi+8]
- mov ecx, dword ptr [rsi]
- movsxd rcx, ecx
+ movsxd rcx, dword ptr [rsi]Code gen for the G_M55886_IG01:
C5F877 vzeroupper
G_M55886_IG02:
4963C1 movsxd rax, r9d
488D4401FC lea rax, [rcx+rax-4]
483BC8 cmp rcx, rax
7750 ja SHORT G_M55886_IG04
G_M55886_IG03:
440FB609 movzx r9, byte ptr [rcx]
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A1102 vmovss dword ptr [rdx], xmm0
440FB64901 movzx r9, byte ptr [rcx+1]
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A114204 vmovss dword ptr [rdx+4], xmm0
440FB64902 movzx r9, byte ptr [rcx+2]
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A114208 vmovss dword ptr [rdx+8], xmm0
440FB64903 movzx r9, byte ptr [rcx+3]
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A11420C vmovss dword ptr [rdx+12], xmm0
4883C104 add rcx, 4
4883C210 add rdx, 16
483BC8 cmp rcx, rax
76B1 jbe SHORT G_M55886_IG03
G_M55886_IG04:
C3 ret
; Total bytes of code 96, prolog size 3 for method Program:Test(long,long,long,int) |
|
It's worth pointing out that casts involving lclvars have a similar issue: static ulong Test(byte x) => xgenerates 0FB6C1 movzx rax, cl ; eax actually but it doesn't really matter
8BC0 mov eax, eaxbut the IR causing this is different so a separate solution/fix will be needed: |
|
@CarolEidt Is it correct for a cast with overflow to store its result in the target register before checking for overflow? I assume that it is because in case of exception the result can only be observed if we're in a try block and the destination is a lclvar. But if we're in a try block the lclvar won't be enregistered so the result of the cast won't be stored in the lclvar by the cast node itself but by a subsequent |
I believe that it would only be incorrect, as you suggest, if the register is the home of the lclVar, and is live outside of the try block. But, as you say, that won't be the case in a try. Even when we enable write-thru of EH variables, as is currently being worked on, the memory location is what will be the definitive copy. If/when we consider keeping these lclVars in registers across exception edges, this would be a concern - but I don't see that happening in the near term. |
fb40135 to
edbec97
Compare
8f3c0e4 to
9ee914e
Compare
4e6ede9 to
8f44ae6
Compare
|
What is status of this PR? What are the next steps to move it forward? |
|
Waiting for #13501 to be sorted out |
Now that #13501 is addressed, we should look at this for 2.1 |
b670617 to
0c5ac0c
Compare
b302389 to
3636ab6
Compare
|
@mikedn what about this one, Is it ready for review? |
|
@sandreenko Ha ha, as much as I would love for this particular one to be ready it still isn't. The ARM version is busted. For one thing, ARM64's emitter unexpectedly emits |
|
The ARM story is pretty hairy:
I'll see what I can do about all this. The last one is pretty dubious to fix. But it's probably a good idea to fix it rather than attempt to workaround it. Having |
|
On hold again until #20126 is done. |
This is only useful in code shared between ARMARCH and XARCH, otherwise it's simpler to just use the required instructions.
genCodeForArrIndex and genCodeForArrOffset emit a ldrsw on ARM64 but that's not necessary. Array lengths are always positive so just use ldr on both ARM32 and ARM64.
genCodeForIndexAddr's range check generation is messed up: - It uses ldrsw to load array length on ARM64. Not needed, the length is positive. - It uses sxtw to sign etxend the array length if the index is native int. But it was already extended by the load. - It creates IND and LEA nodes out of thin air. Just call the appropiate emitter functions. - It always generates a TYP_I_IMPL cmp, even when the index is TYP_INT. Well, that's just bogus.
The scaling of immediate operands is a property of the instruction and its encoding. It doesnt' make sense to throw the instruction size (emitAttr) into the mix, that's a codegen/emitter specific concept. On ARM32 it's also almost meaningless, at least in the case of integer types - all instructions are really EA_4BYTE, even ldrb, ldrh etc. The ARM64 emitter already extracts the scaling factor from the instruction. It can't use the instruction size as on ARM64 the size is used to select between 32 bit and 64 bit instructions so it's never EA_1BYTE/EA_2BYTE.
ARM64's ins_Load returns INS_ldrsw for TYP_INT but there's nothing in the JIT type system that requires sign extending TYP_INT values on load. The fact that an indir node has TYP_INT doesn't even imply that the value to load is signed, it may be unsigned and indir nodes will never have type TYP_UINT nor have the GTF_UNSIGNED flag set. XARCH uses a mov (rather than movsxd, the equivalent of ldrsw) so it zero extends. There's no reason for ARM64 to behave differently and doing so makes it more difficult to share codegen logic between XARCH and ARM64 Other ARM64 compilers also use ldr rather than ldrsw. This requires patching up emitInsAdjustLoadStoreAttr so EA_4BYTE loads don't end up using EA_8BYTE, which ldrsw requires.
In particular, cleanup selection of acquire/release instructions. The existing code starts by selecting a "normal" instruction, only to throw it away and redo the type/size logic in the volatile case. And get it wrong in the process, it required that "targetType" be TYP_UINT or TYP_LONG to use ldar. But there are no TYP_UINT indirs. Also rename "targetType" to "type", using "target" is misleading. The real target type would be genActualType(tree->TypeGet()).
- Require EA_4BYTE for byte/halfword instructions on ARM32. - Remove emitInsAdjustLoadStoreAttr on ARM64. Getting the correct instruction size simply requires using emitActualTypeSize, that will provide the correct size on both ARM32 and ARM64. - Replace emitTypeSize with emitActualTypeSize as needed.
|
What's the plan for this PR @RussKeldorph ? Maybe we should close it until it is ready - it's almost 2y old now, with last changes from 7 months ago ... |
|
OK, I'm going to close this until #20126 is merged. I should have probably ignored ARM and finish this for x86/64 but I wasn't expecting that it would take so long. Oh well, it's too late for 3.0 anyway. |
This improves code generation for (u)int -> (u)long casts without overflow when the cast operand is a GT_IND node. Currently this is a two step process - first load the value from memory (and widen it to TYP_INT if it's a small int value) and then widen to TYP_LONG. This can be avoided by containing the cast operand.
Fixes #12595