Memmove: Use Vector types rather than Custom blocks#25172
Memmove: Use Vector types rather than Custom blocks#25172benaadams wants to merge 4 commits intodotnet:masterfrom
Conversation
What is the measured difference between 128-bit only and also supporting 256-bit here? |
| Debug.Assert(len > 16 && len <= 64); | ||
| #if HAS_CUSTOM_BLOCKS | ||
| *(Block16*)dest = *(Block16*)src; // [0,16] | ||
| Unsafe.AsRef<Vector128<byte>>(dest) = Unsafe.AsRef<Vector128<byte>>(src); // [0,16] |
There was a problem hiding this comment.
We already have a pointer, why not explicitly Load. That would more easily allow use of the non-temporal loads, if desirable (and make it easier to test them).
There was a problem hiding this comment.
Don't know if non-temporal is desirable.
Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?
There was a problem hiding this comment.
Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?
I think providing a Load helper method could be used to handle that, and abstract away the x86 vs Arm vs fallback cases.
| goto MCPY01; | ||
| #if HAS_CUSTOM_BLOCKS | ||
| Unsafe.As<byte, Block16>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Block16>(ref Unsafe.Add(ref src, 16)); // [0,32] | ||
| Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); // [0,32] |
There was a problem hiding this comment.
Given that we pin for long arrays, and have handling for short arrays, is there any reason we can't just pin for the medium array length case as well?
Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); is hard to read and easy to get wrong.
There was a problem hiding this comment.
There are two different methods one that takes pointers; and this one that takes ref and doesn't pin anything
There was a problem hiding this comment.
It pins if it hits the native fallback case (long arrays), where it calls _Memmove.
For >= Haswell the xmm and ymm have the same clock cycle cost (though one of the ymm ones is +1/3 latency) As its operating on double the data size should be a net win? I also wanted a good excuse to double the pinvoke switch over length 😉 |
|
Any numbers from micro-benchmarks demonstrating the improvement? |
Some processors, especially newer ones, can dispatch two 128-bit instructions or one 256-bit instruction per cycle. There are some potential drawbacks to using the 256-bit instructions, especially when the alignment is not guaranteed (and potential minor downclocking that can occur). So, I was wondering if we had any numbers showing that the 256-bit path was worth the additional maintenance cost (or if just having a 128-bit path is sufficient for the 95% of cases). |
Can minimise the changes instead so it deletes more than it adds as per last commit [+8 -12]; so less maintenance cost than currently? Will follow up with perf numbers |
| #if AMD64 || ARM64 || (BIT32 && !ARM) | ||
| #define HAS_CUSTOM_BLOCKS | ||
| using Block16 = System.Runtime.Intrinsics.Vector128<byte>; | ||
| using Block32 = System.Runtime.Intrinsics.Vector256<byte>; |
There was a problem hiding this comment.
How is this going to behave on ARM64, where Vector256 is not a TYP_SIMD (is it two 16-byte copies or is it copying 4 ulongs, since that is what Vector256<T> has as its fields)?
What about on x86 when AVX is not supported?
There was a problem hiding this comment.
TYP_SIMD doesn't matter; as there is no field access it will behave the same as all 32 byte structs (like the Block64 previously) and use the largest R2Rable size e.g. the R2R version for x64 is
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ...
G_M6895_IG11:
movdqu xmm0, qword ptr [rdx]
movdqu qword ptr [rcx], xmm0
movdqu xmm0, qword ptr [rdx+16]
movdqu qword ptr [rcx+16], xmm0However, because it is a TYP_SIMD the Tier-1 Jit x64 becomes
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ...
G_M6895_IG11:
vmovupd ymm0, ymmword ptr[rdx]
vmovupd ymmword ptr[rcx], ymm0Whereas it stays as the 16-byte chunks with a regular struct.
I'm not sure why ARM32 is not included, maybe it faults if not aligned?
There was a problem hiding this comment.
TYP_SIMD doesn't matter
I might be mistaken, but I believe that optimization depends on TYP_SIMD and therefore FEATURE_SIMD.
FEATURE_SIMD is not enabled for ARM32, which would explain why it is missing out.
There was a problem hiding this comment.
Checking how to get arm disasm
There was a problem hiding this comment.
Not sure how to do ARM; x86 is fine though
; Emitting BLENDED_CODE for generic X86 CPU - Windows
; ...
G_M6903_IG11:
movdqu xmm0, qword ptr [edx]
movdqu qword ptr [ecx], xmm0
movdqu xmm0, qword ptr [edx+16]
movdqu qword ptr [ecx+16], xmm0There was a problem hiding this comment.
and
; Emitting BLENDED_CODE for generic X86 CPU - Windows
; Tier-1 compilation
G_M6903_IG11:
vmovupd ymm0, ymmword ptr[edx]
vmovupd ymmword ptr[ecx], ymm0There was a problem hiding this comment.
If you have built x86 windows, you should have an arm32 altjit. This will show you what code the jit would produce on arm32.
For jit-diffs you can use this via --altjit armelnonjit.dll.
With corerun you can set COMPlus_AltJitName=armelnonjit.dll and COMPlus_AltJit=*.
For crossgen, COMPlus_AltJitName=armelnonjit.dll and COMPlus_AltJitNgen=*.
There was a problem hiding this comment.
Thank you @AndyAyersMS
ARM32 using the blocks and without seems to generate the same asm
G_M6903_IG11:
ldr r1, [r5]
str r1, [r4]
ldr r1, [r5+4]
str r1, [r4+4]
ldr r1, [r5+8]
str r1, [r4+8]
ldr r1, [r5+12]
str r1, [r4+12]
; ... etcHowever; I don't know enough about ARM to take out the non-block path, so will leave it in
Without NEON a multi-register instruction seems fastest at +11% (though also uses more registers)
LDMIA r1!, {r3 - r10}
STMIA r0!, {r3 - r10}followed by the asm that is currently generated
LDR r3, [r1], #4
STR r3, [r0], #4http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
There was a problem hiding this comment.
Actually if its the same might as well remove the non-block path as its a lot of code?
Looks like 2x 256bit loads and 1x 256bit store; so need to tweak a little. Skylake: L1 Cache Peak Bandwidth (bytes/cyc) => 96 (2x32B Load + 1*32B Store) (a) Haswell: The L1 data cache can handle two 256-bit load and one 256-bit store operations each cycle. (b) a. From Table 2-4. Cache Parameters of the Skylake Microarchitecture |
|
Hmm... can get 25+% for 512-8192+bytes, however the algo is outlandishly sensitive to changes; so trying to improve the lower end regresses the larger sizes. Not sure is worth spending more time on it as this time. |


With "Allow pregenerating most HW intrinsics in CoreLib" #24917, we can use the
Vector128andVector256types andIsSupportedcheck and still have the methods R2R AoT compiled for fast start; using 128-bit copies.When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.
This allows the best of both worlds.
/cc @jkotas @GrabYourPitchforks @tannergooding @MichalStrehovsky
Notes:
Doubled the length for managed copy after the upgrade when the register size doubles.
The R2R version is almost identical; however some
leas creap in (I think @mikedn has a fix for this)becomes
The Tier-1 likewise