[Arm64] Enable cpblk loop unrolling#10776
Conversation
|
@dotnet/arm64-contrib PTAL |
|
cc @dotnet/jit-contrib |
src/jit/codegenarm64.cpp
Outdated
|
|
||
| // Fill the remainder (15 bytes or less) if there's one. | ||
| if ((size & 0xf) != 0) | ||
| if ((size & 0x7) != 0) |
There was a problem hiding this comment.
Unless you implement the larger load/stores, the comment is wrong now as this does (7 bytes or less)
src/jit/codegenarm64.cpp
Outdated
| assert(genIsValidFloatReg(xmmReg)); | ||
| size_t slots = size / XMM_REGSIZE_BYTES; | ||
| // TODO-ARM64-CQ: Consider using LDP/STP to save codesize. | ||
| size_t slots = size / REGSIZE_BYTES; |
There was a problem hiding this comment.
It seems like Arm64 should do 16-bytes at a time.
If you grab a FP/Vector register it would be the same as X64
There was a problem hiding this comment.
@briansull I was deferring doing 16 bytes at a time.
- I was just trying to get this to work. Trying to walk before run.
- ldp/stp with integer register might be slightly faster than ldr/str Q, but I needed to figure out how to allocate a pair of adjacent registers if I wanted to use ldp/stp (if I remember the instruction encoding form correctly).
- I needed to look at the armv8 spec for alignment requirements.
- I was concerned about using FP registers and its impact on Linux kernel context switch. This is probably not a real issue.
So I just looked at the armv8 spec. Per B2.4.2 Alignment of data accesses -- ldr V0 would require an 16 byte aligned address. VLD1 V0.16B would have no alignment requirements.
Is there any cpBlk alignment guarantee? I assume it is guaranteed to be 8 byte aligned.
There was a problem hiding this comment.
First the LDP/STP instructions do not require adjacent register number, any two registers can be used. So you can use two integer registers if you want to avoid using FP.
I don't think that there is an alignment requirement except that the actual SP register must be aligned. (if n == 31 thenCheckSPAlignment();)
My manual for B2.4.2 reads:
The alignment requirements for accesses to Normal memory are as follows:
For all instructions that load or store a single or multiple registers, other than
Load-Exclusive/Store-Exclusive and Load-Acquire/Store-Release, if the address that is accessed is not aligned to the size of the data element being accessed, then one of the following occurs:
— An Alignment fault is generated.
— An unaligned access is performed.
SCTLR_ELx.A at the current Exception level can be configured to enable an alignment check, and thereby determine which of these two options is used.
• For all Load-Exclusive/Store-Exclusive and Load-Acquire/Store-Release memory accesses that access a single element or a pair of elements, an Alignment fault is generated if the address being accessed is not aligned to the size of the data structure being accessed
There was a problem hiding this comment.
Key phrase is "one of the following occurs: — An Alignment fault is generated." If SCTLR.A is set an alignment fault will fire. It would be safer to generate code which does not assume SCTLR.A == 0.
There was a problem hiding this comment.
Thanks for the LDP tip. I hadn't gotten a chance to recheck that yet.
Floating point offers code size advantages. For instance, VLD1 V0, V1, V2, V3, [Xn] could load the whole block in a single instruction, but the adjacency requirements makes me avoid trying this. I assume lrsa does not already support requesting adjacent registers.
There was a problem hiding this comment.
Yeah, I don't think that lsra has any support (yet) for adjacent register requirements.
src/jit/codegenarm64.cpp
Outdated
| // loads and stores. | ||
| if (size >= XMM_REGSIZE_BYTES) | ||
| // Grab the integer temp register to emit the loads and stores. | ||
| regNumber tmpReg = genRegNumFromMask(cpBlkNode->gtRsvdRegs & RBM_ALLINT); |
There was a problem hiding this comment.
You would also have to teach lower to reserve both a fp/vector register and an integer register.
But x64 also must have that logic already.
There was a problem hiding this comment.
If we switch to Q reg we should be able to just use one the FP register.
There was a problem hiding this comment.
Well if you are copying an odd number of bytes you use the FP register to do the multiple of 16 amount and the integer register to do the rest.
There was a problem hiding this comment.
Yes, but you can do the rest with the same FP register on arm64. You just use the single lane forms of the ST1 Vt.B[0], [ Xn ]
There was a problem hiding this comment.
Yeah, I hadn't consider using those instructions. The ARM64 encoder does support them. I remember implementing them.
src/jit/codegenarm64.cpp
Outdated
| NYI("genCodeForStoreOffset"); | ||
| #endif // !0 | ||
| // For arm64 these functions are identical | ||
| genCodeForLoadOffset(ins, size, src, base, offset); |
There was a problem hiding this comment.
This is inconsistent with the design. Should use emitIns_S_R() for the store
|
@briansull I just added ldp/stp support to this PR. PTAL |
| offset += base->gtLclFld.gtLclOffs; | ||
|
|
||
| // TODO-ARM64-CQ: Implement support for using a ldp instruction with a varNum (see emitIns_R_S) | ||
| emit->emitIns_R_S(INS_ldr, EA_8BYTE, dst, base->gtLclVarCommon.gtLclNum, offset); |
There was a problem hiding this comment.
Yeah, That part isn't implemented yet.
The problem is that there isn't encoding space available to encode both registers in the instruction, because one register in the encoding is reserved for something else (larger SP offsets)
|
Looks Good |
|
@dotnet-bot Test OSX10.12 x64 Checked Build and Test (build break fix) |
Fixes #10623