Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Enable cast operand containment#12676

Closed
mikedn wants to merge 9 commits intodotnet:masterfrom
mikedn:long-cast
Closed

[WIP] Enable cast operand containment#12676
mikedn wants to merge 9 commits intodotnet:masterfrom
mikedn:long-cast

Conversation

@mikedn
Copy link

@mikedn mikedn commented Jul 7, 2017

This improves code generation for (u)int -> (u)long casts without overflow when the cast operand is a GT_IND node. Currently this is a two step process - first load the value from memory (and widen it to TYP_INT if it's a small int value) and then widen to TYP_LONG. This can be avoided by containing the cast operand.

Fixes #12595

@mikedn
Copy link
Author

mikedn commented Jul 8, 2017

jit-diff summary:

Total bytes of diff: -10079 (-0.05% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -2377 : System.Private.CoreLib.dasm (-0.06% of base)
       -1220 : Microsoft.CodeAnalysis.CSharp.dasm (-0.06% of base)
       -1052 : System.Private.Xml.dasm (-0.03% of base)
        -851 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.04% of base)
        -840 : System.Reflection.Metadata.dasm (-0.58% of base)
78 total files with size differences (78 improved, 0 regressed), 51 unchanged.
Top method regressions by size (bytes):
          60 ( 2.97% of base) : System.Private.Xml.dasm - ReflectionXmlSerializationWriter:WriteStructMethod(ref,ref,ref,ref,bool,bool):this
           3 ( 0.37% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LocalRewriter:MakeArguments(ref,struct,ref,ref,bool,struct,byref,byref,bool,ubyte):struct:this
           2 ( 7.14% of base) : System.Private.CoreLib.dasm - UIntPtr:ToUInt32():int:this
           2 ( 7.41% of base) : System.Private.CoreLib.dasm - UIntPtr:op_Explicit(long):int
Top method improvements by size (bytes):
        -103 (-0.78% of base) : System.Reflection.Metadata.dasm - MetadataReader:InitializeTableReaders(struct,ubyte,ref,ref):this
        -101 (-0.96% of base) : System.Private.CoreLib.dasm - DefaultBinder:BindToMethod(int,ref,byref,ref,ref,ref,byref):ref:this
         -82 (-3.21% of base) : Microsoft.VisualBasic.dasm - OverloadResolution:MoreSpecificProcedure(ref,ref,ref,ref,int,byref,bool):ref
         -64 (-6.24% of base) : Microsoft.DotNet.ProjectModel.dasm - CommonCompilerOptions:Equals(ref):bool:this
         -63 (-2.89% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LocalRewriter:MakeConversion(ref,ref,ref,ubyte,ref,bool,bool,bool,bool,ref,ref):ref:this
Top method regressions by size (percentage):
           2 ( 7.41% of base) : System.Private.CoreLib.dasm - UIntPtr:op_Explicit(long):int
           2 ( 7.14% of base) : System.Private.CoreLib.dasm - UIntPtr:ToUInt32():int:this
          60 ( 2.97% of base) : System.Private.Xml.dasm - ReflectionXmlSerializationWriter:WriteStructMethod(ref,ref,ref,ref,bool,bool):this
           3 ( 0.37% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LocalRewriter:MakeArguments(ref,struct,ref,ref,bool,struct,byref,byref,bool,ubyte):struct:this
Top method improvements by size (percentage):
          -2 (-18.18% of base) : System.Net.HttpListener.dasm - HttpListenerTimeoutManager:get_MinSendBytesPerSecond():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_MessagesSent():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_MessagesReceived():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_ErrorsSent():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_ErrorsReceived():long:this
2220 total methods with size differences (2216 improved, 4 regressed), 110665 unchanged.

Sample diffs:

        movzx    r8, byte  ptr [rcx+17]
-       mov      r8d, r8d

        mov      edx, dword ptr [rsi+rcx+40]
-       mov      edx, edx

-       movsx    rcx, word  ptr [rsi+8]
-       movsxd   rax, ecx
+       movsx    rax, word  ptr [rsi+8]

-       mov      ecx, dword ptr [rsi]
-       movsxd   rcx, ecx
+       movsxd   rcx, dword ptr [rsi]

Code gen for the LUTMap function from the associated issue:

G_M55886_IG01:
       C5F877               vzeroupper
G_M55886_IG02:
       4963C1               movsxd   rax, r9d
       488D4401FC           lea      rax, [rcx+rax-4]
       483BC8               cmp      rcx, rax
       7750                 ja       SHORT G_M55886_IG04
G_M55886_IG03:
       440FB609             movzx    r9, byte  ptr [rcx]
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A1102           vmovss   dword ptr [rdx], xmm0
       440FB64901           movzx    r9, byte  ptr [rcx+1]
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114204         vmovss   dword ptr [rdx+4], xmm0
       440FB64902           movzx    r9, byte  ptr [rcx+2]
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114208         vmovss   dword ptr [rdx+8], xmm0
       440FB64903           movzx    r9, byte  ptr [rcx+3]
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A11420C         vmovss   dword ptr [rdx+12], xmm0
       4883C104             add      rcx, 4
       4883C210             add      rdx, 16
       483BC8               cmp      rcx, rax
       76B1                 jbe      SHORT G_M55886_IG03
G_M55886_IG04:
       C3                   ret

; Total bytes of code 96, prolog size 3 for method Program:Test(long,long,long,int)

@mikedn
Copy link
Author

mikedn commented Jul 8, 2017

It's worth pointing out that casts involving lclvars have a similar issue:

static ulong Test(byte x) => x

generates

0FB6C1               movzx    rax, cl ; eax actually but it doesn't really matter
8BC0                 mov      eax, eax

but the IR causing this is different so a separate solution/fix will be needed:

N005 (  2,  2) [000001] ------------ t1 =    LCL_VAR   int    V00 arg0         u:2 rcx (last use) REG rcx $80
                                                 /--*  t1     int
N007 (  3,  4) [000005] ------------ t5 = *  CAST      int <- ubyte <- int REG rax $100
                                                 /--*  t5     int
N009 (  4,  6) [000002] ---------U-- t2 = *  CAST      long <- ulong <- uint REG rax $140
                                                 /--*  t2     long
N011 (  5,  7) [000003] ------------ *  RETURN    long   REG NA $180

@mikedn
Copy link
Author

mikedn commented Jul 12, 2017

@CarolEidt Is it correct for a cast with overflow to store its result in the target register before checking for overflow?

I assume that it is because in case of exception the result can only be observed if we're in a try block and the destination is a lclvar. But if we're in a try block the lclvar won't be enregistered so the result of the cast won't be stored in the lclvar by the cast node itself but by a subsequent GT_STORE_LCL_VAR.

@CarolEidt
Copy link

Is it correct for a cast with overflow to store its result in the target register before checking for overflow?

I believe that it would only be incorrect, as you suggest, if the register is the home of the lclVar, and is live outside of the try block. But, as you say, that won't be the case in a try. Even when we enable write-thru of EH variables, as is currently being worked on, the memory location is what will be the definitive copy. If/when we consider keeping these lclVars in registers across exception edges, this would be a concern - but I don't see that happening in the near term.

@mikedn mikedn force-pushed the long-cast branch 2 times, most recently from fb40135 to edbec97 Compare July 13, 2017 16:18
@mikedn mikedn changed the title [WIP] Improve int->long cast codegen [WIP] Enable cast operand containment Jul 25, 2017
@mikedn mikedn force-pushed the long-cast branch 3 times, most recently from 8f3c0e4 to 9ee914e Compare August 20, 2017 20:47
@mikedn mikedn force-pushed the long-cast branch 2 times, most recently from 4e6ede9 to 8f44ae6 Compare August 26, 2017 22:35
@karelz
Copy link
Member

karelz commented Aug 30, 2017

What is status of this PR? What are the next steps to move it forward?

@mikedn
Copy link
Author

mikedn commented Aug 30, 2017

Waiting for #13501 to be sorted out

@karelz karelz added the blocked Issue/PR is blocked on something - see comments label Aug 30, 2017
@CarolEidt
Copy link

Waiting for #13501 to be sorted out

Now that #13501 is addressed, we should look at this for 2.1

@CarolEidt CarolEidt self-requested a review April 12, 2018 23:11
@mikedn mikedn force-pushed the long-cast branch 3 times, most recently from b670617 to 0c5ac0c Compare August 17, 2018 16:35
@mikedn mikedn force-pushed the long-cast branch 2 times, most recently from b302389 to 3636ab6 Compare September 12, 2018 17:28
@sandreenko
Copy link

@mikedn what about this one, Is it ready for review?

@mikedn
Copy link
Author

mikedn commented Sep 14, 2018

@sandreenko Ha ha, as much as I would love for this particular one to be ready it still isn't. The ARM version is busted. For one thing, ARM64's emitter unexpectedly emits ldrsw instead of ldr. And then there appears to be some other issue that leads to asserts during crossgen.

@mikedn
Copy link
Author

mikedn commented Sep 14, 2018

The ARM story is pretty hairy:

  • ARM32 asserts when attempting to emit ldrb/ldrsb/ldrh/ldrsh with EA_4BYTE. IMO the assert is bogus, the instruction size is really 4 bytes, ARM does not have 1/2 byte instructions like xarch has.
  • Lack of support for reg optional on ARM, the relevant code in genConsumeRegs is ifdefed out, even if SetRegOptional is available.
  • OptimizeConstCompare removes a cast node and does not call ClearContained on the cast operand. We end up with a compare instruction with a contained operand, inherited from the cast. Naturally, that fails on ARM.
  • Broken register requirements on casts with contained operand. The contained indir may require a temporary register to deal with unsupported address modes. BuildCast doesn't know that.
  • emitInsLoadStoreOp emits a 64 bit ldr when asked to emit a 32 bit ldr. There seems to be some confusion throughout ARM64 codegen - for TYP_INT indirs it emits sxtw instead of a 32 bit ldr. This deviates from X64 codegen and other ARM64 compilers.

I'll see what I can do about all this.

The last one is pretty dubious to fix. But it's probably a good idea to fix it rather than attempt to workaround it. Having emitInsLoadStoreOp unexpectedly emit a 64 bit ldr is asking for trouble.

@mikedn
Copy link
Author

mikedn commented Sep 25, 2018

On hold again until #20126 is done.

This is only useful in code shared between ARMARCH and XARCH, otherwise it's simpler to just use the required instructions.
genCodeForArrIndex and genCodeForArrOffset emit a ldrsw on ARM64 but that's not necessary. Array lengths are always positive so just use ldr on both ARM32 and ARM64.
genCodeForIndexAddr's range check generation is messed up:
- It uses ldrsw to load array length on ARM64. Not needed, the length is positive.
- It uses sxtw to sign etxend the array length if the index is native int. But it was already extended by the load.
- It creates IND and LEA nodes out of thin air. Just call the appropiate emitter functions.
- It always generates a TYP_I_IMPL cmp, even when the index is TYP_INT. Well, that's just bogus.
The scaling of immediate operands is a property of the instruction and its encoding. It doesnt' make sense to throw the instruction size (emitAttr) into the mix, that's a codegen/emitter specific concept. On ARM32 it's also almost meaningless, at least in the case of integer types - all instructions are really EA_4BYTE, even ldrb, ldrh etc.

The ARM64 emitter already extracts the scaling factor from the instruction. It can't use the instruction size as on ARM64 the size is used to select between 32 bit and 64 bit instructions so it's never EA_1BYTE/EA_2BYTE.
ARM64's ins_Load returns INS_ldrsw for TYP_INT but there's nothing in the JIT type system that requires sign extending TYP_INT values on load. The fact that an indir node has TYP_INT doesn't even imply that the value to load is signed, it may be unsigned and indir nodes will never have type TYP_UINT nor have the GTF_UNSIGNED flag set.

XARCH uses a mov (rather than movsxd, the equivalent of ldrsw) so it zero extends. There's no reason for ARM64 to behave differently and doing so makes it more difficult to share codegen logic between XARCH and ARM64

Other ARM64 compilers also use ldr rather than ldrsw.

This requires patching up emitInsAdjustLoadStoreAttr so EA_4BYTE loads don't end up using EA_8BYTE, which ldrsw requires.
In particular, cleanup selection of acquire/release instructions. The existing code starts by selecting a "normal" instruction, only to throw it away and redo the type/size logic in the volatile case. And get it wrong in the process, it required that "targetType" be TYP_UINT or TYP_LONG to use ldar. But there are no TYP_UINT indirs.

Also rename "targetType" to "type", using "target" is misleading. The real target type would be genActualType(tree->TypeGet()).
- Require EA_4BYTE for byte/halfword instructions on ARM32.
- Remove emitInsAdjustLoadStoreAttr on ARM64. Getting the correct instruction size simply requires using emitActualTypeSize, that will provide the correct size on both ARM32 and ARM64.
- Replace emitTypeSize with emitActualTypeSize as needed.
@karelz
Copy link
Member

karelz commented May 4, 2019

What's the plan for this PR @RussKeldorph ? Maybe we should close it until it is ready - it's almost 2y old now, with last changes from 7 months ago ...

@mikedn
Copy link
Author

mikedn commented May 4, 2019

OK, I'm going to close this until #20126 is merged. I should have probably ignored ARM and finish this for x86/64 but I wasn't expecting that it would take so long. Oh well, it's too late for 3.0 anyway.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen blocked Issue/PR is blocked on something - see comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset

7 participants