Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Performance improvements to Span.Slice#23665

Closed
GrabYourPitchforks wants to merge 1 commit intodotnet:masterfrom
GrabYourPitchforks:spanslice
Closed

[WIP] Performance improvements to Span.Slice#23665
GrabYourPitchforks wants to merge 1 commit intodotnet:masterfrom
GrabYourPitchforks:spanslice

Conversation

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 2, 2019

This updates Span<T>.Slice and ReadOnlySpan<T>.Slice to zero-extend rather than sign-extend the index when computing the new base address. It's a proof of concept demonstrating the performance gains we can get from addressing https://github.com/dotnet/coreclr/issues/23666. I hacked together a prototype that performs mov elimination in all cases, then re-introduced this PR and reran the benchmarks to come up with the following results.

Method Toolchain Mean Error StdDev Ratio
Utf8Parser_Sum 3.0-master 20.97 ms 0.4830 ms 1.0903 ms 1.00
Utf8Parser_Sum spanslice 19.78 ms 0.3494 ms 0.3269 ms 0.88
CountSpaces 3.0-master 12.92 ms 0.2084 ms 0.1848 ms 1.00
CountSpaces spanslice 12.57 ms 0.1028 ms 0.0911 ms 0.97
private readonly byte[] _bytes = Encoding.UTF8.GetBytes("1 22 333 4444 55555 666666 7777777 1 22 333 4444 55555 666666 7777777 1 22 333 4444 55555 666666 7777777 1 22 333 4444 55555 666666 7777777 ");

[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public int Utf8Parser_Sum()
{
    byte[] bytes = _bytes;
    _ = bytes.Length; // allow JIT to determine non-null
    int sum = default;

    for (int i = NUM_ITERS; i > 0; i--)
    {
        ReadOnlySpan<byte> copy = bytes;
        while (!copy.IsEmpty)
        {
            Utf8Parser.TryParse(copy, out int value, out int bytesConsumed, 'N');
            copy = copy.Slice(bytesConsumed);
            copy = copy.Slice(1);
            sum += value;
        }
    }

    return sum;
}

[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public int CountSpaces()
{
    byte[] bytes = _bytes;
    _ = bytes.Length; // allow JIT to determine non-null
    int count = default;

    for (int i = NUM_ITERS; i > 0; i--)
    {
        ReadOnlySpan<byte> copy = bytes;
        while (!copy.IsEmpty)
        {
            int index = copy.IndexOf((byte)' ');
            if (index < 0) { break; }

            count++;
            copy = copy.Slice(index + 1);
        }
    }

    return count;
}

I didn't change the other overloads of Slice because I couldn't create a benchmark that clearly demonstrated a perf win.

@GrabYourPitchforks
Copy link
Member Author

/cc @dotnet/jit-contrib as FYI, @stephentoub, @jkotas

@jkotas
Copy link
Member

jkotas commented Apr 2, 2019

How hard is this to fix it in the JIT?

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Apr 2, 2019

The managed changes are very straightforward - use unsigned instead of signed integers when performing byref arithmetic. A shift-F12 on Unsafe.Add<T>(ref T, int) will quickly find interesting call sites. Somebody from the JIT side should speak to the required changes there since I didn't see any immediately obvious way to address it.

Edit for clarification: We can't just do a search-and-replace on the managed side since there are valid scenarios where a caller might have wanted to use signed byref arithmetic. But my gut tells me that the majority of call sites only ever pass non-negative values and would be eligible candidates to call the unsigned API over the signed API.

@jkotas
Copy link
Member

jkotas commented Apr 3, 2019

I have looked at the asm delta for Utf8Parser_Sum before and after your change. It changes movsxd rax,eax into mov eax,eax. Is it what you see as well? (I does not make any perf difference on my machine.)

// The loop from Utf8Parser_Sum
mov     rcx,qword ptr [rsi+8]
test    rcx,rcx
jne     test!My.Work()+0x3f (00007ff7`ede3226f)
xor     ebp,ebp
xor     r14d,r14d
jmp     test!My.Work()+0x47 (00007ff7`ede32277)
lea     rbp,[rcx+10h]
mov     r14d,dword ptr [rcx+8]
test    r14d,r14d
jbe     test!My.Work()+0x99 (00007ff7`ede322c9)
lea     rcx,[rsp+20h]
mov     qword ptr [rcx],rbp
mov     dword ptr [rcx+8],r14d
lea     rcx,[rsp+20h]
lea     rdx,[rsp+38h]
lea     r8,[rsp+30h]
mov     r9d,4Eh
call    System.Buffers.Text.Utf8Parser.TryParse(System.ReadOnlySpan`1<Byte>, Int32 ByRef, Int32 ByRef, Char) (00007ff7`ede31f08)
mov     eax,dword ptr [rsp+30h]
cmp     eax,r14d
ja      test!My.Work()+0xac (00007ff7`ede322dc)
sub     r14d,eax
movsxd  rax,eax // this turns into mov eax,eax with your change
add     rbp,rax
cmp     r14d,1
jb      test!My.Work()+0xb2 (00007ff7`ede322e2)
dec     r14d
inc     rbp
add     edi,dword ptr [rsp+38h]
test    r14d,r14d
ja      test!My.Work()+0x4c (00007ff7`ede3227c)
dec     ebx
test    ebx,ebx
jg      test!My.Work()+0x2f (00007ff7`ede3225f)

@GrabYourPitchforks
Copy link
Member Author

@jkotas - I also hacked the JIT to eliminate the mov entirely. Basically, I "fixed" https://github.com/dotnet/coreclr/issues/23666 and reran the benchmarks to get the numbers you see here. This PR doesn't have that hack (because it was nowhere near production-quality).

@AndyAyersMS
Copy link
Member

I don't think there is an easy fix in the jit. There are a bunch of inter-related issues we should tackle in a coordinated fashion:

@GrabYourPitchforks
Copy link
Member Author

Looks like this isn't needed since per Andy's comments there are tracking issues that will improve the JIT all-around. Closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants