[WIP] Performance improvements to Span.Slice by GrabYourPitchforks · Pull Request #23665 · dotnet/coreclr

GrabYourPitchforks · 2019-04-02T21:25:31Z

This updates Span<T>.Slice and ReadOnlySpan<T>.Slice to zero-extend rather than sign-extend the index when computing the new base address. It's a proof of concept demonstrating the performance gains we can get from addressing https://github.com/dotnet/coreclr/issues/23666. I hacked together a prototype that performs mov elimination in all cases, then re-introduced this PR and reran the benchmarks to come up with the following results.

Method	Toolchain	Mean	Error	StdDev	Ratio
Utf8Parser_Sum	3.0-master	20.97 ms	0.4830 ms	1.0903 ms	1.00
Utf8Parser_Sum	spanslice	19.78 ms	0.3494 ms	0.3269 ms	0.88

CountSpaces	3.0-master	12.92 ms	0.2084 ms	0.1848 ms	1.00
CountSpaces	spanslice	12.57 ms	0.1028 ms	0.0911 ms	0.97

private readonly byte[] _bytes = Encoding.UTF8.GetBytes("1 22 333 4444 55555 666666 7777777 1 22 333 4444 55555 666666 7777777 1 22 333 4444 55555 666666 7777777 1 22 333 4444 55555 666666 7777777 ");

[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public int Utf8Parser_Sum()
{
    byte[] bytes = _bytes;
    _ = bytes.Length; // allow JIT to determine non-null
    int sum = default;

    for (int i = NUM_ITERS; i > 0; i--)
    {
        ReadOnlySpan<byte> copy = bytes;
        while (!copy.IsEmpty)
        {
            Utf8Parser.TryParse(copy, out int value, out int bytesConsumed, 'N');
            copy = copy.Slice(bytesConsumed);
            copy = copy.Slice(1);
            sum += value;
        }
    }

    return sum;
}

[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public int CountSpaces()
{
    byte[] bytes = _bytes;
    _ = bytes.Length; // allow JIT to determine non-null
    int count = default;

    for (int i = NUM_ITERS; i > 0; i--)
    {
        ReadOnlySpan<byte> copy = bytes;
        while (!copy.IsEmpty)
        {
            int index = copy.IndexOf((byte)' ');
            if (index < 0) { break; }

            count++;
            copy = copy.Slice(index + 1);
        }
    }

    return count;
}

I didn't change the other overloads of Slice because I couldn't create a benchmark that clearly demonstrated a perf win.

GrabYourPitchforks · 2019-04-02T21:29:09Z

/cc @dotnet/jit-contrib as FYI, @stephentoub, @jkotas

jkotas · 2019-04-02T21:41:21Z

How hard is this to fix it in the JIT?

GrabYourPitchforks · 2019-04-02T21:47:16Z

The managed changes are very straightforward - use unsigned instead of signed integers when performing byref arithmetic. A shift-F12 on Unsafe.Add<T>(ref T, int) will quickly find interesting call sites. Somebody from the JIT side should speak to the required changes there since I didn't see any immediately obvious way to address it.

Edit for clarification: We can't just do a search-and-replace on the managed side since there are valid scenarios where a caller might have wanted to use signed byref arithmetic. But my gut tells me that the majority of call sites only ever pass non-negative values and would be eligible candidates to call the unsigned API over the signed API.

jkotas · 2019-04-03T00:53:04Z

I have looked at the asm delta for Utf8Parser_Sum before and after your change. It changes movsxd rax,eax into mov eax,eax. Is it what you see as well? (I does not make any perf difference on my machine.)

// The loop from Utf8Parser_Sum
mov     rcx,qword ptr [rsi+8]
test    rcx,rcx
jne     test!My.Work()+0x3f (00007ff7`ede3226f)
xor     ebp,ebp
xor     r14d,r14d
jmp     test!My.Work()+0x47 (00007ff7`ede32277)
lea     rbp,[rcx+10h]
mov     r14d,dword ptr [rcx+8]
test    r14d,r14d
jbe     test!My.Work()+0x99 (00007ff7`ede322c9)
lea     rcx,[rsp+20h]
mov     qword ptr [rcx],rbp
mov     dword ptr [rcx+8],r14d
lea     rcx,[rsp+20h]
lea     rdx,[rsp+38h]
lea     r8,[rsp+30h]
mov     r9d,4Eh
call    System.Buffers.Text.Utf8Parser.TryParse(System.ReadOnlySpan`1<Byte>, Int32 ByRef, Int32 ByRef, Char) (00007ff7`ede31f08)
mov     eax,dword ptr [rsp+30h]
cmp     eax,r14d
ja      test!My.Work()+0xac (00007ff7`ede322dc)
sub     r14d,eax
movsxd  rax,eax // this turns into mov eax,eax with your change
add     rbp,rax
cmp     r14d,1
jb      test!My.Work()+0xb2 (00007ff7`ede322e2)
dec     r14d
inc     rbp
add     edi,dword ptr [rsp+38h]
test    r14d,r14d
ja      test!My.Work()+0x4c (00007ff7`ede3227c)
dec     ebx
test    ebx,ebx
jg      test!My.Work()+0x2f (00007ff7`ede3225f)

GrabYourPitchforks · 2019-04-03T00:58:55Z

@jkotas - I also hacked the JIT to eliminate the mov entirely. Basically, I "fixed" https://github.com/dotnet/coreclr/issues/23666 and reran the benchmarks to get the numbers you see here. This PR doesn't have that hack (because it was nowhere near production-quality).

AndyAyersMS · 2019-04-03T01:43:15Z

I don't think there is an easy fix in the jit. There are a bunch of inter-related issues we should tackle in a coordinated fashion:

Allow cast containment ([WIP] Enable cast operand containment #12676, responding to #12595, waiting on Sort out ARM load/store instruction size issues #20126)
Don't sign extend for array.len, span.len, and known in-bounds indices when implicit zero extension is available for a 32 bit load (as it is on x64 and arm64). One prototype here: https://github.com/dotnet/coreclr/issues/23127#issuecomment-471040829
Don't explicitly zero extend values that are already implicitly zero extended (in the spirit of JIT: Suppress emitting same-reg zero extending move #22454 but done earlier when we have better producer-consumer connections).
IV widening for 64 bit targets (#9179, [WIP] Eliminate array/Span index sign-extension #21553)

GrabYourPitchforks · 2019-11-05T06:52:33Z

Looks like this isn't needed since per Andy's comments there are tracking issues that will improve the JIT all-around. Closing this.

Performance improvements to Span.Slice

ca3e000

jkotas added the area-CodeGen label Nov 2, 2019

GrabYourPitchforks closed this Nov 5, 2019

GrabYourPitchforks mentioned this pull request Jan 31, 2020

JIT should suppress zero-extending same-register moves in more scenarios dotnet/runtime#12402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Performance improvements to Span.Slice#23665

[WIP] Performance improvements to Span.Slice#23665
GrabYourPitchforks wants to merge 1 commit intodotnet:masterfrom
GrabYourPitchforks:spanslice

GrabYourPitchforks commented Apr 2, 2019 •

edited

Loading

Uh oh!

GrabYourPitchforks commented Apr 2, 2019

Uh oh!

jkotas commented Apr 2, 2019

Uh oh!

GrabYourPitchforks commented Apr 2, 2019 •

edited

Loading

Uh oh!

jkotas commented Apr 3, 2019 •

edited

Loading

Uh oh!

GrabYourPitchforks commented Apr 3, 2019

Uh oh!

AndyAyersMS commented Apr 3, 2019

Uh oh!

GrabYourPitchforks commented Nov 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

GrabYourPitchforks commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrabYourPitchforks commented Apr 2, 2019

Uh oh!

jkotas commented Apr 2, 2019

Uh oh!

GrabYourPitchforks commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrabYourPitchforks commented Apr 3, 2019

Uh oh!

AndyAyersMS commented Apr 3, 2019

Uh oh!

GrabYourPitchforks commented Nov 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GrabYourPitchforks commented Apr 2, 2019 •

edited

Loading

GrabYourPitchforks commented Apr 2, 2019 •

edited

Loading

jkotas commented Apr 3, 2019 •

edited

Loading