Skip to content

Tighten up some SSE2 ASCII + UTF-16 transcoding logic#32036

Merged
GrabYourPitchforks merged 2 commits intodotnet:masterfrom
GrabYourPitchforks:intr_opt
Feb 11, 2020
Merged

Tighten up some SSE2 ASCII + UTF-16 transcoding logic#32036
GrabYourPitchforks merged 2 commits intodotnet:masterfrom
GrabYourPitchforks:intr_opt

Conversation

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Feb 10, 2020

Cleans up some of the logic slightly by removing SSE4.1-specific paths when SSE2 paths suffice with the same number of instructions. Also optimizes the SSE2-specific codegen a little bit by reusing existing registers whenever possible.

As an example of the type of optimization performed here:

/* old code - 3 instructions (pxor, pcmpgtw, pmovmskb) */
Sse2.MoveMask(Sse2.CompareGreaterThan(Sse2.Xor(combinedVector, asciiMaskForPXOR), asciiMaskForPCMPGTW).AsByte()

/* new code - 2 instructions (paddusw, pmovmskb) */
Sse2.MoveMask(Sse2.AddSaturate(combinedVector, asciiMaskForPADDUSW).AsByte())

Dropping from 3 instructions down to 2 instructions also allows us to get rid of one of the "const" vectors that we use for these operations, hence the more efficient use of SIMD registers.

This code was already fuzzed as part of #31904 (comment).

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization. Any perf numbers?

@GrabYourPitchforks
Copy link
Member Author

Raw throughput difference is in the range of noise, which isn't surprising. But I confirmed that there's more efficient use of registers:

;;; OLD ;;;
00007ffd`5e48bb90 b80080ffff      mov     eax,0FFFF8000h
00007ffd`5e48bb95 660f6ec0        movd    xmm0,eax
00007ffd`5e48bb99 0f28c8          movaps  xmm1,xmm0
00007ffd`5e48bb9c 660f61c8        punpcklwd xmm1,xmm0
00007ffd`5e48bba0 0f28c1          movaps  xmm0,xmm1
00007ffd`5e48bba3 660f70c000      pshufd  xmm0,xmm0,0
00007ffd`5e48bba8 0f28c8          movaps  xmm1,xmm0
00007ffd`5e48bbab b87f80ffff      mov     eax,0FFFF807Fh
00007ffd`5e48bbb0 660f6ed0        movd    xmm2,eax
00007ffd`5e48bbb4 0f28da          movaps  xmm3,xmm2
00007ffd`5e48bbb7 660f61da        punpcklwd xmm3,xmm2
00007ffd`5e48bbbb 0f28d3          movaps  xmm2,xmm3

;;; NEW ;;;
00007ffd`519ab240 b8807f0000      mov     eax,7F80h
00007ffd`519ab245 660f6ec0        movd    xmm0,eax
00007ffd`519ab249 0f28c8          movaps  xmm1,xmm0
00007ffd`519ab24c 660f61c8        punpcklwd xmm1,xmm0
00007ffd`519ab250 0f28c1          movaps  xmm0,xmm1
00007ffd`519ab253 660f70c000      pshufd  xmm0,xmm0,0
00007ffd`519ab258 0f28c8          movaps  xmm1,xmm0

And the main loop is also a little tighter, as expected:

;;; OLD ;;;
00007ffd`5e48bbc6 f30f6f21        movdqu  xmm4,xmmword ptr [rcx]
00007ffd`5e48bbca 0f28ec          movaps  xmm5,xmm4
00007ffd`5e48bbcd 660fefe8        pxor    xmm5,xmm0
00007ffd`5e48bbd1 0f28c5          movaps  xmm0,xmm5
00007ffd`5e48bbd4 660f65c2        pcmpgtw xmm0,xmm2
00007ffd`5e48bbd8 660fd7c0        pmovmskb eax,xmm0
00007ffd`5e48bbdc 85c0            test    eax,eax

;;; NEW ;;;
00007ffd`519ab28c f30f6f5110      movdqu  xmm2,xmmword ptr [rcx+10h]
00007ffd`519ab291 0f28c2          movaps  xmm0,xmm2
00007ffd`519ab294 660fddc1        paddusw xmm0,xmm1
00007ffd`519ab298 66440fd7c8      pmovmskb r9d,xmm0
00007ffd`519ab29d 4d63c9          movsxd  r9,r9d
00007ffd`519ab2a0 41f7c1aaaa0000  test    r9d,0AAAAh

I'm looking into why there's a stray sign extension in there. I wouldn't have expected that to be there.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Feb 11, 2020

Found and fixed the unnecessary movsxd instruction. :)
Edit: Opened #32089 to track improving this JIT-wide.

@GrabYourPitchforks
Copy link
Member Author

CI is on the floor right now due to failures in System.Threading.Tasks.Tests. Unrelated to this PR. Proceeding with merge.

@GrabYourPitchforks
Copy link
Member Author

Thanks Tanner for the review! :)

@GrabYourPitchforks GrabYourPitchforks merged commit 720f2d6 into dotnet:master Feb 11, 2020
@GrabYourPitchforks GrabYourPitchforks deleted the intr_opt branch February 11, 2020 04:33
monojenkins pushed a commit to monojenkins/mono that referenced this pull request Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time).

So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon).

for `Vector128<byte>` overload it emits
```llvm
%result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right)
```
which is then emitted as `vpaddusw`
akoeplinger pushed a commit to mono/mono that referenced this pull request Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time).

So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon).

for `Vector128<byte>` overload it emits
```llvm
%result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right)
```
which is then emitted as `vpaddusw`

Co-authored-by: Egor Bogatov <egorbo@gmail.com>
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants