Use PoolingAsyncValueTaskMethodBuilder for LoadIntoBufferAsync#2
Use PoolingAsyncValueTaskMethodBuilder for LoadIntoBufferAsync#2neuecc merged 1 commit intoCysharp:mainfrom
PoolingAsyncValueTaskMethodBuilder for LoadIntoBufferAsync#2Conversation
|
I had a problem regarding the Pool attribute. Thank you for the benchmarks as well. Indeed, the current default (NoBuffer_Async) doesn't sho very good results. The results of the microbenchmarks make it tricky to decide how to handle useAsync. |
|
Due to the nature of the implementation, the larger the buffer, the fewer copies and ReadAsync calls, which directly impacts performance. In serializers like MessagePack and MemoryPack, I've been using 64K buffers obtained from the ArrayPool. |
|
Looking back at the .NET 6 release documentation, I reviewed the following: Regarding Read, in the detailed explanation by Adam Sitnik, the comparison of async true/false in a Windows environment is omitted! In theory, to summarize:
I think I want to add an option to change the useAsync argument of Utf8StreamReader. |
|
Interesting, I did not expect it to scale so well to such a large buffer size. It seems to be much more pronounced on Windows too. Reading your notes gave me some food for thought that I can use use for yet another rewrite of Either way, thank you and have a good day :) |


This applies
PoolingAsyncValueTaskMethodBuilderin the similar manner to whatSocketandStreamdo.While it does not improve read performance itself, on low-allocation asynchronous paths I discovered this to be a contributor to GC collection and pause frequency, making it profitable.
FromFilebefore:FromFileafter:*The main bottleneck especially on Windows for File IO is the latency of calls into kernel APIs. Sometimes it gets particularly bad with Windows's implementation of Overlapped. In fact, on Unix systems it is much more well-behaved despite more naive-ish implementation of blocking file read as a separate threadpool work item - the latency is way lower and this approach does not even cause issues thanks to Threadpool's ability to cope with blocked workers really well. One way to address this is increasing starting buffer size to 4KiB or even 8KiB, which U8Reader does. My theory here is modern NVME drives have pretty large memory pages internally, as well as filesystems operating on 4, 8 or even 32KiB pages themselves. This way a single file read could naturally align and let OS pull in one or multiple evenly sliced pages. I don't know whether this in fact helps, but increasing buffer size up to 8KiB has shown consistent improvement on all systems for all situations.
There is also additional cost to async depth even if valuetasks are pooled. In fact, it is so problematic that I had to rewrite
ReadToAsyncCoreand various methods forwarding to it in U8Reader entirely in order to makeU8SplitReader(effectively a line reader variant for custom delimiters) profitable to use - it is still expensive-ish, but at least it gives reason for users to do a more memory-friendlyawait foreach (var segment in file.AsU8Reader(false).Split(',')over pre-buffering large text segments and using normal split iterator.If the line length in your use case is substantial, or the data becomes available in small chunks, like a network stream delimited by lines, this will not help - all calls end up doing the asynchronous
await. But if you expect the buffer to have, let's say 5 or more lines, it becomes important to avoid as manyawaits as possible even if they complete synchronously (which is why I'm also usingRandomAccess).And also, thank you for your work, your libraries are my go-to recommendation! 😀