[Arm64] Enable cpblk loop unrolling by sdmaclea · Pull Request #10776 · dotnet/coreclr

sdmaclea · 2017-04-06T21:31:36Z

Fixes #10623

sdmaclea · 2017-04-06T21:31:58Z

@dotnet/arm64-contrib PTAL

BruceForstall · 2017-04-06T21:43:12Z

cc @dotnet/jit-contrib

briansull · 2017-04-06T22:29:48Z

src/jit/codegenarm64.cpp


    // Fill the remainder (15 bytes or less) if there's one.
-    if ((size & 0xf) != 0)
+    if ((size & 0x7) != 0)


Unless you implement the larger load/stores, the comment is wrong now as this does (7 bytes or less)

briansull · 2017-04-06T22:37:21Z

src/jit/codegenarm64.cpp

-        assert(genIsValidFloatReg(xmmReg));
-        size_t slots = size / XMM_REGSIZE_BYTES;
+        // TODO-ARM64-CQ: Consider using LDP/STP to save codesize.
+        size_t slots = size / REGSIZE_BYTES;


It seems like Arm64 should do 16-bytes at a time.
If you grab a FP/Vector register it would be the same as X64

LDR Q0, [addr]
STR Q0, [addr]

@briansull I was deferring doing 16 bytes at a time.

I was just trying to get this to work. Trying to walk before run.

ldp/stp with integer register might be slightly faster than ldr/str Q, but I needed to figure out how to allocate a pair of adjacent registers if I wanted to use ldp/stp (if I remember the instruction encoding form correctly).

I needed to look at the armv8 spec for alignment requirements.

I was concerned about using FP registers and its impact on Linux kernel context switch. This is probably not a real issue.

So I just looked at the armv8 spec. Per B2.4.2 Alignment of data accesses -- ldr V0 would require an 16 byte aligned address. VLD1 V0.16B would have no alignment requirements.

Is there any cpBlk alignment guarantee? I assume it is guaranteed to be 8 byte aligned.

First the LDP/STP instructions do not require adjacent register number, any two registers can be used. So you can use two integer registers if you want to avoid using FP.

I don't think that there is an alignment requirement except that the actual SP register must be aligned. (if n == 31 thenCheckSPAlignment();)

My manual for B2.4.2 reads:

The alignment requirements for accesses to Normal memory are as follows:
For all instructions that load or store a single or multiple registers, other than
Load-Exclusive/Store-Exclusive and Load-Acquire/Store-Release, if the address that is accessed is not aligned to the size of the data element being accessed, then one of the following occurs:
— An Alignment fault is generated.
— An unaligned access is performed.
SCTLR_ELx.A at the current Exception level can be configured to enable an alignment check, and thereby determine which of these two options is used.

• For all Load-Exclusive/Store-Exclusive and Load-Acquire/Store-Release memory accesses that access a single element or a pair of elements, an Alignment fault is generated if the address being accessed is not aligned to the size of the data structure being accessed

Key phrase is "one of the following occurs: — An Alignment fault is generated." If SCTLR.A is set an alignment fault will fire. It would be safer to generate code which does not assume SCTLR.A == 0.

Thanks for the LDP tip. I hadn't gotten a chance to recheck that yet.

Floating point offers code size advantages. For instance, VLD1 V0, V1, V2, V3, [Xn] could load the whole block in a single instruction, but the adjacency requirements makes me avoid trying this. I assume lrsa does not already support requesting adjacent registers.

Yeah, I don't think that lsra has any support (yet) for adjacent register requirements.

briansull · 2017-04-06T22:39:37Z

src/jit/codegenarm64.cpp

-    // loads and stores.
-    if (size >= XMM_REGSIZE_BYTES)
+    // Grab the integer temp register to emit the loads and stores.
+    regNumber tmpReg = genRegNumFromMask(cpBlkNode->gtRsvdRegs & RBM_ALLINT);


You would also have to teach lower to reserve both a fp/vector register and an integer register.
But x64 also must have that logic already.

If we switch to Q reg we should be able to just use one the FP register.

Well if you are copying an odd number of bytes you use the FP register to do the multiple of 16 amount and the integer register to do the rest.

Yes, but you can do the rest with the same FP register on arm64. You just use the single lane forms of the ST1 Vt.B[0], [ Xn ]

Yeah, I hadn't consider using those instructions. The ARM64 encoder does support them. I remember implementing them.

sdmaclea · 2017-04-07T14:19:03Z

src/jit/codegenarm64.cpp

-    NYI("genCodeForStoreOffset");
-#endif // !0
+    // For arm64 these functions are identical
+    genCodeForLoadOffset(ins, size, src, base, offset);


This is inconsistent with the design. Should use emitIns_S_R() for the store

sdmaclea · 2017-04-07T17:38:46Z

@briansull I just added ldp/stp support to this PR. PTAL

briansull · 2017-04-07T18:27:33Z

src/jit/codegenarm64.cpp

+            offset += base->gtLclFld.gtLclOffs;
+
+        // TODO-ARM64-CQ: Implement support for using a ldp instruction with a varNum (see emitIns_R_S)
+        emit->emitIns_R_S(INS_ldr, EA_8BYTE, dst, base->gtLclVarCommon.gtLclNum, offset);


Yeah, That part isn't implemented yet.
The problem is that there isn't encoding space available to encode both registers in the instruction, because one register in the encoding is reserved for something else (larger SP offsets)

briansull · 2017-04-07T18:28:59Z

Looks Good

danmoseley · 2017-04-08T19:21:52Z

@dotnet-bot Test OSX10.12 x64 Checked Build and Test (build break fix)

[Arm64] Enable cpblk loop unrolling

e021cad

dnfclas added the cla-already-signed label Apr 6, 2017

Fix formating

2255bb9

briansull reviewed Apr 6, 2017

View reviewed changes

sdmaclea commented Apr 7, 2017

View reviewed changes

sdmaclea added 2 commits April 7, 2017 16:07

Address review feedback

d5d1415

[Arm64] Use ldp/stp in CpBlkUnroll

97bac15

briansull reviewed Apr 7, 2017

View reviewed changes

BruceForstall merged commit 2a6fb30 into dotnet:master Apr 9, 2017

sdmaclea deleted the PR-ARM64-CpBlkUnroll branch April 10, 2017 15:57

karelz modified the milestone: 2.0.0 Aug 28, 2017

Conversation

sdmaclea commented Apr 6, 2017

Uh oh!

sdmaclea commented Apr 6, 2017

Uh oh!

BruceForstall commented Apr 6, 2017

Uh oh!

briansull Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdmaclea Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

briansull Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdmaclea commented Apr 7, 2017

Uh oh!

briansull Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

briansull commented Apr 7, 2017

Uh oh!

danmoseley commented Apr 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

briansull Apr 6, 2017 •

edited

Loading

sdmaclea Apr 7, 2017 •

edited

Loading

briansull Apr 7, 2017 •

edited

Loading

briansull Apr 7, 2017 •

edited

Loading