Tighten up some SSE2 ASCII + UTF-16 transcoding logic#32036
Merged
GrabYourPitchforks merged 2 commits intodotnet:masterfrom Feb 11, 2020
Merged
Tighten up some SSE2 ASCII + UTF-16 transcoding logic#32036GrabYourPitchforks merged 2 commits intodotnet:masterfrom
GrabYourPitchforks merged 2 commits intodotnet:masterfrom
Conversation
tannergooding
approved these changes
Feb 10, 2020
Member
tannergooding
left a comment
There was a problem hiding this comment.
Nice optimization. Any perf numbers?
Member
Author
|
Raw throughput difference is in the range of noise, which isn't surprising. But I confirmed that there's more efficient use of registers: ;;; OLD ;;;
00007ffd`5e48bb90 b80080ffff mov eax,0FFFF8000h
00007ffd`5e48bb95 660f6ec0 movd xmm0,eax
00007ffd`5e48bb99 0f28c8 movaps xmm1,xmm0
00007ffd`5e48bb9c 660f61c8 punpcklwd xmm1,xmm0
00007ffd`5e48bba0 0f28c1 movaps xmm0,xmm1
00007ffd`5e48bba3 660f70c000 pshufd xmm0,xmm0,0
00007ffd`5e48bba8 0f28c8 movaps xmm1,xmm0
00007ffd`5e48bbab b87f80ffff mov eax,0FFFF807Fh
00007ffd`5e48bbb0 660f6ed0 movd xmm2,eax
00007ffd`5e48bbb4 0f28da movaps xmm3,xmm2
00007ffd`5e48bbb7 660f61da punpcklwd xmm3,xmm2
00007ffd`5e48bbbb 0f28d3 movaps xmm2,xmm3
;;; NEW ;;;
00007ffd`519ab240 b8807f0000 mov eax,7F80h
00007ffd`519ab245 660f6ec0 movd xmm0,eax
00007ffd`519ab249 0f28c8 movaps xmm1,xmm0
00007ffd`519ab24c 660f61c8 punpcklwd xmm1,xmm0
00007ffd`519ab250 0f28c1 movaps xmm0,xmm1
00007ffd`519ab253 660f70c000 pshufd xmm0,xmm0,0
00007ffd`519ab258 0f28c8 movaps xmm1,xmm0And the main loop is also a little tighter, as expected: ;;; OLD ;;;
00007ffd`5e48bbc6 f30f6f21 movdqu xmm4,xmmword ptr [rcx]
00007ffd`5e48bbca 0f28ec movaps xmm5,xmm4
00007ffd`5e48bbcd 660fefe8 pxor xmm5,xmm0
00007ffd`5e48bbd1 0f28c5 movaps xmm0,xmm5
00007ffd`5e48bbd4 660f65c2 pcmpgtw xmm0,xmm2
00007ffd`5e48bbd8 660fd7c0 pmovmskb eax,xmm0
00007ffd`5e48bbdc 85c0 test eax,eax
;;; NEW ;;;
00007ffd`519ab28c f30f6f5110 movdqu xmm2,xmmword ptr [rcx+10h]
00007ffd`519ab291 0f28c2 movaps xmm0,xmm2
00007ffd`519ab294 660fddc1 paddusw xmm0,xmm1
00007ffd`519ab298 66440fd7c8 pmovmskb r9d,xmm0
00007ffd`519ab29d 4d63c9 movsxd r9,r9d
00007ffd`519ab2a0 41f7c1aaaa0000 test r9d,0AAAAhI'm looking into why there's a stray sign extension in there. I wouldn't have expected that to be there. |
Member
Author
|
Found and fixed the unnecessary |
Member
Author
|
CI is on the floor right now due to failures in System.Threading.Tasks.Tests. Unrelated to this PR. Proceeding with merge. |
Member
Author
|
Thanks Tanner for the review! :) |
monojenkins
pushed a commit
to monojenkins/mono
that referenced
this pull request
Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time). So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon). for `Vector128<byte>` overload it emits ```llvm %result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right) ``` which is then emitted as `vpaddusw`
akoeplinger
pushed a commit
to mono/mono
that referenced
this pull request
Feb 11, 2020
Mono currently supports a limited subset of **Sse1**-**Sse42** intrinsics used only by CoreLib internally. So once a new API is used to optimize things there we also have to implement it in Mono. So it recently happened in dotnet/runtime#32036 (`Sse2.AddSaturate` was used for the first time). So this PR implements it for mono (with all Sse2 overloads) using named LLVM intrinsics. It turns out it's different between LLVM6 (we currently based on) and LLVM9 (we plan to migrate to soon). for `Vector128<byte>` overload it emits ```llvm %result = call <8 x i16> @llvm.x86.sse2.paddus.b(<16 x i8> %left, <16 x i8> %right) ``` which is then emitted as `vpaddusw` Co-authored-by: Egor Bogatov <egorbo@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cleans up some of the logic slightly by removing SSE4.1-specific paths when SSE2 paths suffice with the same number of instructions. Also optimizes the SSE2-specific codegen a little bit by reusing existing registers whenever possible.
As an example of the type of optimization performed here:
Dropping from 3 instructions down to 2 instructions also allows us to get rid of one of the "const" vectors that we use for these operations, hence the more efficient use of SIMD registers.
This code was already fuzzed as part of #31904 (comment).