Use `PoolingAsyncValueTaskMethodBuilder` for `LoadIntoBufferAsync` by neon-sunset · Pull Request #2 · Cysharp/Utf8StreamReader

neon-sunset · 2024-03-28T11:06:56Z

This applies PoolingAsyncValueTaskMethodBuilder in the similar manner to what Socket and Stream do.
While it does not improve read performance itself, on low-allocation asynchronous paths I discovered this to be a contributor to GC collection and pause frequency, making it profitable.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.202
  [Host]     : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2

FromFile before:

Method	Mean	Error	StdDev	Gen0	Allocated
StreamReader	140.5 ms	1.36 ms	1.20 ms	11500.0000	186.49 MB
Utf8StreamReader	174.9 ms	2.26 ms	2.12 ms	-	2.1 MB
Utf8TextReader	192.7 ms	0.33 ms	0.30 ms	-	2.1 MB
Utf8TextReaderToString	192.4 ms	1.06 ms	0.94 ms	-	2.1 MB

FromFile after:

Method	Mean	Error	StdDev	Gen0	Allocated
StreamReader	138.6 ms	1.48 ms	1.38 ms	11666.6667	190967.17 KB
Utf8StreamReader	177.6 ms	2.83 ms	2.64 ms	-	1.16 KB
Utf8TextReader	197.8 ms	0.73 ms	0.65 ms	-	1.38 KB
Utf8TextReaderToString	192.0 ms	0.63 ms	0.56 ms	-	1.2 KB
😅*	78.87 ms	1.371 ms	1.216 ms	4000.0000	69238.38 KB

*The main bottleneck especially on Windows for File IO is the latency of calls into kernel APIs. Sometimes it gets particularly bad with Windows's implementation of Overlapped. In fact, on Unix systems it is much more well-behaved despite more naive-ish implementation of blocking file read as a separate threadpool work item - the latency is way lower and this approach does not even cause issues thanks to Threadpool's ability to cope with blocked workers really well. One way to address this is increasing starting buffer size to 4KiB or even 8KiB, which U8Reader does. My theory here is modern NVME drives have pretty large memory pages internally, as well as filesystems operating on 4, 8 or even 32KiB pages themselves. This way a single file read could naturally align and let OS pull in one or multiple evenly sliced pages. I don't know whether this in fact helps, but increasing buffer size up to 8KiB has shown consistent improvement on all systems for all situations.

There is also additional cost to async depth even if valuetasks are pooled. In fact, it is so problematic that I had to rewrite ReadToAsyncCore and various methods forwarding to it in U8Reader entirely in order to make U8SplitReader (effectively a line reader variant for custom delimiters) profitable to use - it is still expensive-ish, but at least it gives reason for users to do a more memory-friendly await foreach (var segment in file.AsU8Reader(false).Split(',') over pre-buffering large text segments and using normal split iterator.

If the line length in your use case is substantial, or the data becomes available in small chunks, like a network stream delimited by lines, this will not help - all calls end up doing the asynchronous await. But if you expect the buffer to have, let's say 5 or more lines, it becomes important to avoid as many awaits as possible even if they complete synchronously (which is why I'm also using RandomAccess).

And also, thank you for your work, your libraries are my go-to recommendation! 😀

neuecc · 2024-03-28T16:24:22Z

I had a problem regarding the Pool attribute.
In the end I didn't put it on,
It has allowed me to hear your great insights!

Thank you for the benchmarks as well.
I hadn't measured the performance specifically for File I/O.
I re-measured with different parameters based on your suggestions.

Indeed, the current default (NoBuffer_Async) doesn't sho very good results.
At least the buffer size looks like it could be larger.

The results of the microbenchmarks make it tricky to decide how to handle useAsync.

neuecc · 2024-03-28T17:00:28Z

Due to the nature of the implementation, the larger the buffer, the fewer copies and ReadAsync calls, which directly impacts performance.

In serializers like MessagePack and MemoryPack, I've been using 64K buffers obtained from the ArrayPool.
So I'm inclined to think 64K might be a good choice for this as well.

neuecc · 2024-03-28T18:29:21Z

Looking back at the .NET 6 release documentation, I reviewed the following:
https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-6/#io
https://devblogs.microsoft.com/dotnet/file-io-improvements-in-dotnet-6/
https://adamsitnik.com/files/Fast_File_IO_with_DOTNET_6.pdf

Regarding Read, in the detailed explanation by Adam Sitnik, the comparison of async true/false in a Windows environment is omitted!
However, in the first article, the comparison of async:true/false being tested is included, and it can be confirmed that when true, the results are significantly slower.
It seems like they didn't want to include a table that would be unfavorable (requiring accurate explanation)...

In theory, to summarize:

Using async I/O reduces thread pool consumption, increasing scalability
Not using async I/O increases thread pool consumption, but it's faster in a Windows environment

I think I want to add an option to change the useAsync argument of Utf8StreamReader.
Also, let's set the buffer size to 64K.

neon-sunset · 2024-03-30T14:17:43Z

Interesting, I did not expect it to scale so well to such a large buffer size. It seems to be much more pronounced on Windows too. Reading your notes gave me some food for thought that I can use use for yet another rewrite of U8Reader. There's still a lot of performance left on the table (especially in regards to GC-friendly allocation patterns) that can be gained through conditional double-buffering, adapting to read modes on the fly, detecting sync/async modes and then feeding that "stratregy" to conditionally re-allocating a handle for sync/async on Windows, and more...though that might be an overkill.

Either way, thank you and have a good day :)

Pool VT on a hotpath

35ff375

neuecc merged commit f66b91f into Cysharp:main Mar 29, 2024

neon-sunset deleted the pool-valuetask branch March 30, 2024 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `PoolingAsyncValueTaskMethodBuilder` for `LoadIntoBufferAsync`#2

Use `PoolingAsyncValueTaskMethodBuilder` for `LoadIntoBufferAsync`#2
neuecc merged 1 commit intoCysharp:mainfrom
neon-sunset:pool-valuetask

neon-sunset commented Mar 28, 2024 •

edited

Loading

Uh oh!

neuecc commented Mar 28, 2024 •

edited

Loading

Uh oh!

neuecc commented Mar 28, 2024

Uh oh!

neuecc commented Mar 28, 2024 •

edited

Loading

Uh oh!

neon-sunset commented Mar 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neon-sunset commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neuecc commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neuecc commented Mar 28, 2024

Uh oh!

neuecc commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neon-sunset commented Mar 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neon-sunset commented Mar 28, 2024 •

edited

Loading

neuecc commented Mar 28, 2024 •

edited

Loading

neuecc commented Mar 28, 2024 •

edited

Loading