Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[Arm64] Enable cpblk loop unrolling#10776

Merged
BruceForstall merged 4 commits intodotnet:masterfrom
sdmaclea:PR-ARM64-CpBlkUnroll
Apr 9, 2017
Merged

[Arm64] Enable cpblk loop unrolling#10776
BruceForstall merged 4 commits intodotnet:masterfrom
sdmaclea:PR-ARM64-CpBlkUnroll

Conversation

@sdmaclea
Copy link

@sdmaclea sdmaclea commented Apr 6, 2017

Fixes #10623

@sdmaclea
Copy link
Author

sdmaclea commented Apr 6, 2017

@dotnet/arm64-contrib PTAL

@BruceForstall
Copy link

cc @dotnet/jit-contrib


// Fill the remainder (15 bytes or less) if there's one.
if ((size & 0xf) != 0)
if ((size & 0x7) != 0)
Copy link

@briansull briansull Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you implement the larger load/stores, the comment is wrong now as this does (7 bytes or less)

assert(genIsValidFloatReg(xmmReg));
size_t slots = size / XMM_REGSIZE_BYTES;
// TODO-ARM64-CQ: Consider using LDP/STP to save codesize.
size_t slots = size / REGSIZE_BYTES;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like Arm64 should do 16-bytes at a time.
If you grab a FP/Vector register it would be the same as X64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LDR Q0, [addr]
STR Q0, [addr]

Copy link
Author

@sdmaclea sdmaclea Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briansull I was deferring doing 16 bytes at a time.

  • I was just trying to get this to work. Trying to walk before run.
  • ldp/stp with integer register might be slightly faster than ldr/str Q, but I needed to figure out how to allocate a pair of adjacent registers if I wanted to use ldp/stp (if I remember the instruction encoding form correctly).
  • I needed to look at the armv8 spec for alignment requirements.
  • I was concerned about using FP registers and its impact on Linux kernel context switch. This is probably not a real issue.

So I just looked at the armv8 spec. Per B2.4.2 Alignment of data accesses -- ldr V0 would require an 16 byte aligned address. VLD1 V0.16B would have no alignment requirements.

Is there any cpBlk alignment guarantee? I assume it is guaranteed to be 8 byte aligned.

Copy link

@briansull briansull Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First the LDP/STP instructions do not require adjacent register number, any two registers can be used. So you can use two integer registers if you want to avoid using FP.

I don't think that there is an alignment requirement except that the actual SP register must be aligned. (if n == 31 thenCheckSPAlignment();)

My manual for B2.4.2 reads:

The alignment requirements for accesses to Normal memory are as follows:
For all instructions that load or store a single or multiple registers, other than
Load-Exclusive/Store-Exclusive and Load-Acquire/Store-Release, if the address that is accessed is not aligned to the size of the data element being accessed, then one of the following occurs:
— An Alignment fault is generated.
— An unaligned access is performed.
SCTLR_ELx.A at the current Exception level can be configured to enable an alignment check, and thereby determine which of these two options is used.

• For all Load-Exclusive/Store-Exclusive and Load-Acquire/Store-Release memory accesses that access a single element or a pair of elements, an Alignment fault is generated if the address being accessed is not aligned to the size of the data structure being accessed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key phrase is "one of the following occurs: — An Alignment fault is generated." If SCTLR.A is set an alignment fault will fire. It would be safer to generate code which does not assume SCTLR.A == 0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the LDP tip. I hadn't gotten a chance to recheck that yet.

Floating point offers code size advantages. For instance, VLD1 V0, V1, V2, V3, [Xn] could load the whole block in a single instruction, but the adjacency requirements makes me avoid trying this. I assume lrsa does not already support requesting adjacent registers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think that lsra has any support (yet) for adjacent register requirements.

// loads and stores.
if (size >= XMM_REGSIZE_BYTES)
// Grab the integer temp register to emit the loads and stores.
regNumber tmpReg = genRegNumFromMask(cpBlkNode->gtRsvdRegs & RBM_ALLINT);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would also have to teach lower to reserve both a fp/vector register and an integer register.
But x64 also must have that logic already.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we switch to Q reg we should be able to just use one the FP register.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if you are copying an odd number of bytes you use the FP register to do the multiple of 16 amount and the integer register to do the rest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can do the rest with the same FP register on arm64. You just use the single lane forms of the ST1 Vt.B[0], [ Xn ]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I hadn't consider using those instructions. The ARM64 encoder does support them. I remember implementing them.

NYI("genCodeForStoreOffset");
#endif // !0
// For arm64 these functions are identical
genCodeForLoadOffset(ins, size, src, base, offset);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with the design. Should use emitIns_S_R() for the store

@sdmaclea
Copy link
Author

sdmaclea commented Apr 7, 2017

@briansull I just added ldp/stp support to this PR. PTAL

offset += base->gtLclFld.gtLclOffs;

// TODO-ARM64-CQ: Implement support for using a ldp instruction with a varNum (see emitIns_R_S)
emit->emitIns_R_S(INS_ldr, EA_8BYTE, dst, base->gtLclVarCommon.gtLclNum, offset);
Copy link

@briansull briansull Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, That part isn't implemented yet.
The problem is that there isn't encoding space available to encode both registers in the instruction, because one register in the encoding is reserved for something else (larger SP offsets)

@briansull
Copy link

Looks Good

@danmoseley
Copy link
Member

@dotnet-bot Test OSX10.12 x64 Checked Build and Test (build break fix)

@BruceForstall BruceForstall merged commit 2a6fb30 into dotnet:master Apr 9, 2017
@sdmaclea sdmaclea deleted the PR-ARM64-CpBlkUnroll branch April 10, 2017 15:57
@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants