Skip to content

Use PoolingAsyncValueTaskMethodBuilder for LoadIntoBufferAsync#2

Merged
neuecc merged 1 commit intoCysharp:mainfrom
neon-sunset:pool-valuetask
Mar 29, 2024
Merged

Use PoolingAsyncValueTaskMethodBuilder for LoadIntoBufferAsync#2
neuecc merged 1 commit intoCysharp:mainfrom
neon-sunset:pool-valuetask

Conversation

@neon-sunset
Copy link
Contributor

@neon-sunset neon-sunset commented Mar 28, 2024

This applies PoolingAsyncValueTaskMethodBuilder in the similar manner to what Socket and Stream do.
While it does not improve read performance itself, on low-allocation asynchronous paths I discovered this to be a contributor to GC collection and pause frequency, making it profitable.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.202
  [Host]     : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

FromFile before:

Method Mean Error StdDev Gen0 Allocated
StreamReader 140.5 ms 1.36 ms 1.20 ms 11500.0000 186.49 MB
Utf8StreamReader 174.9 ms 2.26 ms 2.12 ms - 2.1 MB
Utf8TextReader 192.7 ms 0.33 ms 0.30 ms - 2.1 MB
Utf8TextReaderToString 192.4 ms 1.06 ms 0.94 ms - 2.1 MB

FromFile after:

Method Mean Error StdDev Gen0 Allocated
StreamReader 138.6 ms 1.48 ms 1.38 ms 11666.6667 190967.17 KB
Utf8StreamReader 177.6 ms 2.83 ms 2.64 ms - 1.16 KB
Utf8TextReader 197.8 ms 0.73 ms 0.65 ms - 1.38 KB
Utf8TextReaderToString 192.0 ms 0.63 ms 0.56 ms - 1.2 KB
😅* 78.87 ms 1.371 ms 1.216 ms 4000.0000 69238.38 KB

*The main bottleneck especially on Windows for File IO is the latency of calls into kernel APIs. Sometimes it gets particularly bad with Windows's implementation of Overlapped. In fact, on Unix systems it is much more well-behaved despite more naive-ish implementation of blocking file read as a separate threadpool work item - the latency is way lower and this approach does not even cause issues thanks to Threadpool's ability to cope with blocked workers really well. One way to address this is increasing starting buffer size to 4KiB or even 8KiB, which U8Reader does. My theory here is modern NVME drives have pretty large memory pages internally, as well as filesystems operating on 4, 8 or even 32KiB pages themselves. This way a single file read could naturally align and let OS pull in one or multiple evenly sliced pages. I don't know whether this in fact helps, but increasing buffer size up to 8KiB has shown consistent improvement on all systems for all situations.

There is also additional cost to async depth even if valuetasks are pooled. In fact, it is so problematic that I had to rewrite ReadToAsyncCore and various methods forwarding to it in U8Reader entirely in order to make U8SplitReader (effectively a line reader variant for custom delimiters) profitable to use - it is still expensive-ish, but at least it gives reason for users to do a more memory-friendly await foreach (var segment in file.AsU8Reader(false).Split(',') over pre-buffering large text segments and using normal split iterator.

If the line length in your use case is substantial, or the data becomes available in small chunks, like a network stream delimited by lines, this will not help - all calls end up doing the asynchronous await. But if you expect the buffer to have, let's say 5 or more lines, it becomes important to avoid as many awaits as possible even if they complete synchronously (which is why I'm also using RandomAccess).

And also, thank you for your work, your libraries are my go-to recommendation! 😀

@neuecc
Copy link
Member

neuecc commented Mar 28, 2024

I had a problem regarding the Pool attribute.
In the end I didn't put it on,
It has allowed me to hear your great insights!

Thank you for the benchmarks as well.
I hadn't measured the performance specifically for File I/O.
I re-measured with different parameters based on your suggestions.

image

Indeed, the current default (NoBuffer_Async) doesn't sho very good results.
At least the buffer size looks like it could be larger.

The results of the microbenchmarks make it tricky to decide how to handle useAsync.

@neuecc
Copy link
Member

neuecc commented Mar 28, 2024

Due to the nature of the implementation, the larger the buffer, the fewer copies and ReadAsync calls, which directly impacts performance.

image

In serializers like MessagePack and MemoryPack, I've been using 64K buffers obtained from the ArrayPool.
So I'm inclined to think 64K might be a good choice for this as well.

@neuecc
Copy link
Member

neuecc commented Mar 28, 2024

Looking back at the .NET 6 release documentation, I reviewed the following:
https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-6/#io
https://devblogs.microsoft.com/dotnet/file-io-improvements-in-dotnet-6/
https://adamsitnik.com/files/Fast_File_IO_with_DOTNET_6.pdf

Regarding Read, in the detailed explanation by Adam Sitnik, the comparison of async true/false in a Windows environment is omitted!
However, in the first article, the comparison of async:true/false being tested is included, and it can be confirmed that when true, the results are significantly slower.
It seems like they didn't want to include a table that would be unfavorable (requiring accurate explanation)...

In theory, to summarize:

  • Using async I/O reduces thread pool consumption, increasing scalability
  • Not using async I/O increases thread pool consumption, but it's faster in a Windows environment

I think I want to add an option to change the useAsync argument of Utf8StreamReader.
Also, let's set the buffer size to 64K.

@neuecc neuecc merged commit f66b91f into Cysharp:main Mar 29, 2024
@neon-sunset
Copy link
Contributor Author

Interesting, I did not expect it to scale so well to such a large buffer size. It seems to be much more pronounced on Windows too. Reading your notes gave me some food for thought that I can use use for yet another rewrite of U8Reader. There's still a lot of performance left on the table (especially in regards to GC-friendly allocation patterns) that can be gained through conditional double-buffering, adapting to read modes on the fly, detecting sync/async modes and then feeding that "stratregy" to conditionally re-allocating a handle for sync/async on Windows, and more...though that might be an overkill.

Either way, thank you and have a good day :)

@neon-sunset neon-sunset deleted the pool-valuetask branch March 30, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants