Tighten up some SSE2 ASCII + UTF-16 transcoding logic by GrabYourPitchforks · Pull Request #32036 · dotnet/runtime

GrabYourPitchforks · 2020-02-10T19:20:36Z

Cleans up some of the logic slightly by removing SSE4.1-specific paths when SSE2 paths suffice with the same number of instructions. Also optimizes the SSE2-specific codegen a little bit by reusing existing registers whenever possible.

As an example of the type of optimization performed here:

/* old code - 3 instructions (pxor, pcmpgtw, pmovmskb) */
Sse2.MoveMask(Sse2.CompareGreaterThan(Sse2.Xor(combinedVector, asciiMaskForPXOR), asciiMaskForPCMPGTW).AsByte()

/* new code - 2 instructions (paddusw, pmovmskb) */
Sse2.MoveMask(Sse2.AddSaturate(combinedVector, asciiMaskForPADDUSW).AsByte())

Dropping from 3 instructions down to 2 instructions also allows us to get rid of one of the "const" vectors that we use for these operations, hence the more efficient use of SIMD registers.

This code was already fuzzed as part of #31904 (comment).

tannergooding

Nice optimization. Any perf numbers?

GrabYourPitchforks · 2020-02-11T02:11:07Z

Raw throughput difference is in the range of noise, which isn't surprising. But I confirmed that there's more efficient use of registers:

;;; OLD ;;;
00007ffd`5e48bb90 b80080ffff      mov     eax,0FFFF8000h
00007ffd`5e48bb95 660f6ec0        movd    xmm0,eax
00007ffd`5e48bb99 0f28c8          movaps  xmm1,xmm0
00007ffd`5e48bb9c 660f61c8        punpcklwd xmm1,xmm0
00007ffd`5e48bba0 0f28c1          movaps  xmm0,xmm1
00007ffd`5e48bba3 660f70c000      pshufd  xmm0,xmm0,0
00007ffd`5e48bba8 0f28c8          movaps  xmm1,xmm0
00007ffd`5e48bbab b87f80ffff      mov     eax,0FFFF807Fh
00007ffd`5e48bbb0 660f6ed0        movd    xmm2,eax
00007ffd`5e48bbb4 0f28da          movaps  xmm3,xmm2
00007ffd`5e48bbb7 660f61da        punpcklwd xmm3,xmm2
00007ffd`5e48bbbb 0f28d3          movaps  xmm2,xmm3

;;; NEW ;;;
00007ffd`519ab240 b8807f0000      mov     eax,7F80h
00007ffd`519ab245 660f6ec0        movd    xmm0,eax
00007ffd`519ab249 0f28c8          movaps  xmm1,xmm0
00007ffd`519ab24c 660f61c8        punpcklwd xmm1,xmm0
00007ffd`519ab250 0f28c1          movaps  xmm0,xmm1
00007ffd`519ab253 660f70c000      pshufd  xmm0,xmm0,0
00007ffd`519ab258 0f28c8          movaps  xmm1,xmm0

And the main loop is also a little tighter, as expected:

;;; OLD ;;;
00007ffd`5e48bbc6 f30f6f21        movdqu  xmm4,xmmword ptr [rcx]
00007ffd`5e48bbca 0f28ec          movaps  xmm5,xmm4
00007ffd`5e48bbcd 660fefe8        pxor    xmm5,xmm0
00007ffd`5e48bbd1 0f28c5          movaps  xmm0,xmm5
00007ffd`5e48bbd4 660f65c2        pcmpgtw xmm0,xmm2
00007ffd`5e48bbd8 660fd7c0        pmovmskb eax,xmm0
00007ffd`5e48bbdc 85c0            test    eax,eax

;;; NEW ;;;
00007ffd`519ab28c f30f6f5110      movdqu  xmm2,xmmword ptr [rcx+10h]
00007ffd`519ab291 0f28c2          movaps  xmm0,xmm2
00007ffd`519ab294 660fddc1        paddusw xmm0,xmm1
00007ffd`519ab298 66440fd7c8      pmovmskb r9d,xmm0
00007ffd`519ab29d 4d63c9          movsxd  r9,r9d
00007ffd`519ab2a0 41f7c1aaaa0000  test    r9d,0AAAAh

I'm looking into why there's a stray sign extension in there. I wouldn't have expected that to be there.

GrabYourPitchforks · 2020-02-11T02:26:36Z

Found and fixed the unnecessary movsxd instruction. :)
Edit: Opened #32089 to track improving this JIT-wide.

GrabYourPitchforks · 2020-02-11T04:32:02Z

CI is on the floor right now due to failures in System.Threading.Tasks.Tests. Unrelated to this PR. Proceeding with merge.

GrabYourPitchforks · 2020-02-11T04:32:11Z

Thanks Tanner for the review! :)

Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time). So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon). for `Vector128<byte>` overload it emits ```llvm %result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right) ``` which is then emitted as `vpaddusw`

Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time). So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon). for `Vector128<byte>` overload it emits ```llvm %result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right) ``` which is then emitted as `vpaddusw` Co-authored-by: Egor Bogatov <egorbo@gmail.com>

Tighten up some SSE2 ASCII + UTF-16 transcoding logic

0f4acf9

GrabYourPitchforks added the area-System.Text.Encoding label Feb 10, 2020

GrabYourPitchforks requested a review from tannergooding February 10, 2020 19:20

GrabYourPitchforks mentioned this pull request Feb 10, 2020

Remove BMI2 intrinsics from ASCII and UTF-8 transcoding logic #31904

Merged

tannergooding approved these changes Feb 10, 2020

View reviewed changes

Avoid unnecessary sign extension after pmovmskb

7da52ac

GrabYourPitchforks merged commit 720f2d6 into dotnet:master Feb 11, 2020

GrabYourPitchforks deleted the intr_opt branch February 11, 2020 04:33

EgorBo mentioned this pull request Feb 11, 2020

[mono] Implement Sse2.AddSaturate using LLVM #32116

Merged

monojenkins mentioned this pull request Feb 11, 2020

[mono] Implement Sse2.AddSaturate using LLVM mono/mono#18793

Merged

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten up some SSE2 ASCII + UTF-16 transcoding logic#32036

Tighten up some SSE2 ASCII + UTF-16 transcoding logic#32036
GrabYourPitchforks merged 2 commits intodotnet:masterfrom
GrabYourPitchforks:intr_opt

GrabYourPitchforks commented Feb 10, 2020 •

edited

Loading

Uh oh!

tannergooding left a comment

Uh oh!

GrabYourPitchforks commented Feb 11, 2020

Uh oh!

GrabYourPitchforks commented Feb 11, 2020 •

edited

Loading

Uh oh!

GrabYourPitchforks commented Feb 11, 2020

Uh oh!

GrabYourPitchforks commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GrabYourPitchforks commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

GrabYourPitchforks commented Feb 11, 2020

Uh oh!

GrabYourPitchforks commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrabYourPitchforks commented Feb 11, 2020

Uh oh!

GrabYourPitchforks commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GrabYourPitchforks commented Feb 10, 2020 •

edited

Loading

GrabYourPitchforks commented Feb 11, 2020 •

edited

Loading