Memmove: Use Vector types rather than Custom blocks by benaadams · Pull Request #25172 · dotnet/coreclr

benaadams · 2019-06-14T16:55:21Z

With "Allow pregenerating most HW intrinsics in CoreLib" #24917, we can use the Vector128 and Vector256 types and IsSupported check and still have the methods R2R AoT compiled for fast start; using 128-bit copies.

When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.

This allows the best of both worlds.

/cc @jkotas @GrabYourPitchforks @tannergooding @MichalStrehovsky

Notes:

Doubled the length for managed copy after the upgrade when the register size doubles.

The R2R version is almost identical; however some leas creap in (I think @mikedn has a fix for this)

G_M6842_IG11:
       movdqu   xmm0, qword ptr [rdx]
       movdqu   qword ptr [rcx], xmm0
       movdqu   xmm0, qword ptr [rdx+16]
       movdqu   qword ptr [rcx+16], xmm0
       movdqu   xmm0, qword ptr [rdx+32]
       movdqu   qword ptr [rcx+32], xmm0
       movdqu   xmm0, qword ptr [rdx+48]
       movdqu   qword ptr [rcx+48], xmm0
       add      rcx, 64
       add      rdx, 64

becomes

G_M6856_IG11:
       mov      r11, rcx
       movdqu   xmm0, qword ptr [rdx]
       movdqu   qword ptr [r11], xmm0
       movdqu   xmm0, qword ptr [rdx+16]
       movdqu   qword ptr [r11+16], xmm0
       lea      r11, [rcx+32]
       lea      rsi, [rdx+32]
       movdqu   xmm0, qword ptr [rsi]
       movdqu   qword ptr [r11], xmm0
       movdqu   xmm0, qword ptr [rsi+16]
       movdqu   qword ptr [r11+16], xmm0
       add      rcx, 64
       add      rdx, 64

The Tier-1 likewise

       vmovupd  ymm0, ymmword ptr[rdx]
       vmovupd  ymmword ptr[rcx], ymm0
       lea      r11, bword ptr [rcx+32]
       lea      rsi, bword ptr [rdx+32]
       vmovupd  ymm0, ymmword ptr[rsi]
       vmovupd  ymmword ptr[r11], ymm0
       add      rcx, 64
       add      rdx, 64

tannergooding · 2019-06-14T16:58:12Z

When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.

What is the measured difference between 128-bit only and also supporting 256-bit here?

tannergooding · 2019-06-14T17:01:39Z

src/System.Private.CoreLib/shared/System/Buffer.cs

            Debug.Assert(len > 16 && len <= 64);
 #if HAS_CUSTOM_BLOCKS
-            *(Block16*)dest = *(Block16*)src;                   // [0,16]
+            Unsafe.AsRef<Vector128<byte>>(dest) = Unsafe.AsRef<Vector128<byte>>(src); // [0,16]


We already have a pointer, why not explicitly Load. That would more easily allow use of the non-temporal loads, if desirable (and make it easier to test them).

Don't know if non-temporal is desirable.

Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?

Also would mean every copy would need to be doubled and guarded with an .IsSupported rather than just working?

I think providing a Load helper method could be used to handle that, and abstract away the x86 vs Arm vs fallback cases.

src/System.Private.CoreLib/shared/System/Buffer.cs

tannergooding · 2019-06-14T17:10:29Z

src/System.Private.CoreLib/shared/System/Buffer.cs

                goto MCPY01;
 #if HAS_CUSTOM_BLOCKS
-            Unsafe.As<byte, Block16>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Block16>(ref Unsafe.Add(ref src, 16)); // [0,32]
+            Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref dest, 16)) = Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); // [0,32]


Given that we pin for long arrays, and have handling for short arrays, is there any reason we can't just pin for the medium array length case as well?

Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref src, 16)); is hard to read and easy to get wrong.

There are two different methods one that takes pointers; and this one that takes ref and doesn't pin anything

It pins if it hits the native fallback case (long arrays), where it calls _Memmove.

benaadams · 2019-06-14T17:20:08Z

When the methods are re-compiled at Tier-1 they then upgrade to 256-bit copies.

What is the measured difference between 128-bit only and also supporting 256-bit here?

For >= Haswell the xmm and ymm have the same clock cycle cost (though one of the ymm ones is +1/3 latency)

As its operating on double the data size should be a net win?

I also wanted a good excuse to double the pinvoke switch over length 😉

jkotas · 2019-06-14T17:27:33Z

Any numbers from micro-benchmarks demonstrating the improvement?

tannergooding · 2019-06-14T17:34:08Z

As its operating on double the data size should be a net win?

Some processors, especially newer ones, can dispatch two 128-bit instructions or one 256-bit instruction per cycle. There are some potential drawbacks to using the 256-bit instructions, especially when the alignment is not guaranteed (and potential minor downclocking that can occur). So, I was wondering if we had any numbers showing that the 256-bit path was worth the additional maintenance cost (or if just having a 128-bit path is sufficient for the 95% of cases).

benaadams · 2019-06-14T18:24:10Z

So, I was wondering if we had any numbers showing that the 256-bit path was worth the additional maintenance cost

Can minimise the changes instead so it deletes more than it adds as per last commit [+8 -12]; so less maintenance cost than currently?

Will follow up with perf numbers

src/System.Private.CoreLib/shared/System/Buffer.cs

tannergooding · 2019-06-14T19:41:06Z

src/System.Private.CoreLib/shared/System/Buffer.cs

 #if AMD64 || ARM64 || (BIT32 && !ARM)
 #define HAS_CUSTOM_BLOCKS
+using Block16 = System.Runtime.Intrinsics.Vector128<byte>;
+using Block32 = System.Runtime.Intrinsics.Vector256<byte>;


How is this going to behave on ARM64, where Vector256 is not a TYP_SIMD (is it two 16-byte copies or is it copying 4 ulongs, since that is what Vector256<T> has as its fields)?

What about on x86 when AVX is not supported?

TYP_SIMD doesn't matter; as there is no field access it will behave the same as all 32 byte structs (like the Block64 previously) and use the largest R2Rable size e.g. the R2R version for x64 is

; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows ; ... G_M6895_IG11: movdqu xmm0, qword ptr [rdx] movdqu qword ptr [rcx], xmm0 movdqu xmm0, qword ptr [rdx+16] movdqu qword ptr [rcx+16], xmm0

However, because it is a TYP_SIMD the Tier-1 Jit x64 becomes

; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; ... G_M6895_IG11: vmovupd ymm0, ymmword ptr[rdx] vmovupd ymmword ptr[rcx], ymm0

Whereas it stays as the 16-byte chunks with a regular struct.

I'm not sure why ARM32 is not included, maybe it faults if not aligned?

TYP_SIMD doesn't matter

I might be mistaken, but I believe that optimization depends on TYP_SIMD and therefore FEATURE_SIMD.

FEATURE_SIMD is not enabled for ARM32, which would explain why it is missing out.

Checking how to get arm disasm

Not sure how to do ARM; x86 is fine though

; Emitting BLENDED_CODE for generic X86 CPU - Windows ; ... G_M6903_IG11: movdqu xmm0, qword ptr [edx] movdqu qword ptr [ecx], xmm0 movdqu xmm0, qword ptr [edx+16] movdqu qword ptr [ecx+16], xmm0

and

; Emitting BLENDED_CODE for generic X86 CPU - Windows ; Tier-1 compilation G_M6903_IG11: vmovupd ymm0, ymmword ptr[edx] vmovupd ymmword ptr[ecx], ymm0

If you have built x86 windows, you should have an arm32 altjit. This will show you what code the jit would produce on arm32.

For jit-diffs you can use this via --altjit armelnonjit.dll.

With corerun you can set COMPlus_AltJitName=armelnonjit.dll and COMPlus_AltJit=*.

For crossgen, COMPlus_AltJitName=armelnonjit.dll and COMPlus_AltJitNgen=*.

Thank you @AndyAyersMS

ARM32 using the blocks and without seems to generate the same asm

G_M6903_IG11: ldr r1, [r5] str r1, [r4] ldr r1, [r5+4] str r1, [r4+4] ldr r1, [r5+8] str r1, [r4+8] ldr r1, [r5+12] str r1, [r4+12] ; ... etc

However; I don't know enough about ARM to take out the non-block path, so will leave it in

Without NEON a multi-register instruction seems fastest at +11% (though also uses more registers)

LDMIA r1!, {r3 - r10} STMIA r0!, {r3 - r10}

followed by the asm that is currently generated

LDR r3, [r1], #4 STR r3, [r0], #4

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

Actually if its the same might as well remove the non-block path as its a lot of code?

benaadams · 2019-06-15T21:45:51Z

As its operating on double the data size should be a net win?

Some processors, especially newer ones, can dispatch two 128-bit instructions or one 256-bit instruction per cycle.

Looks like 2x 256bit loads and 1x 256bit store; so need to tweak a little.

Skylake: L1 Cache Peak Bandwidth (bytes/cyc) => 96 (2x32B Load + 1*32B Store) (a)

Haswell: The L1 data cache can handle two 256-bit load and one 256-bit store operations each cycle. (b)

a. From Table 2-4. Cache Parameters of the Skylake Microarchitecture

b. From 2.2.4.1 Load and Store Operation Enhancements

benaadams · 2019-06-17T17:14:17Z

Hmm... can get 25+% for 512-8192+bytes, however the algo is outlandishly sensitive to changes; so trying to improve the lower end regresses the larger sizes. Not sure is worth spending more time on it as this time.

tannergooding reviewed Jun 14, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Buffer.cs Outdated Show resolved Hide resolved

tannergooding reviewed Jun 14, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Buffer.cs Outdated Show resolved Hide resolved

tannergooding reviewed Jun 14, 2019

View reviewed changes

benaadams added 3 commits June 15, 2019 02:32

Use Vector types rather than Custom blocks

1d7dc6f

Minimise code changes

b6654f1

Use Vector for all arch

08abcec

benaadams force-pushed the MemoryCopy branch from 990b328 to 08abcec Compare June 15, 2019 01:32

Use single memmove

bcd6b67

benaadams closed this Jun 17, 2019

jkotas mentioned this pull request Oct 16, 2019

Rewrite Buffer.BlockCopy in C# #27216

Merged

Conversation

benaadams commented Jun 14, 2019

Uh oh!

tannergooding commented Jun 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams commented Jun 14, 2019

Uh oh!

jkotas commented Jun 14, 2019

Uh oh!

tannergooding commented Jun 14, 2019

Uh oh!

benaadams commented Jun 14, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Jun 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Jun 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams commented Jun 15, 2019

Uh oh!

benaadams commented Jun 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benaadams Jun 15, 2019 •

edited

Loading

benaadams Jun 15, 2019 •

edited

Loading