Add support for io_uring read method#36103
Add support for io_uring read method#36103alexey-milovidov merged 7 commits intoClickHouse:masterfrom
Conversation
|
Fast test is using subset of submodules, the list should be updated at |
|
Let's also compare performance with |
|
We can test this method by settings randomization, see |
|
Ran more tests, added the And in CPU usage: Also ran the same tests a lot more times with a warmed up page cache, here which I suppose makes sense as there's some overhead in keeping track of promises/requests in a hash map. Maybe short-circuiting using |
|
The results are good, so we can even consider making this method to be default.
I think 3% is tolerable :) |
|
Updated the SQE submission error handling part. Is there something else missing? Should I worry about the failing tests? |
|
@alexey-milovidov ping? |
|
According to the tests it does not work at all: https://s3.amazonaws.com/clickhouse-test-reports/36103/1142f0c70ff9b9fcba5fa944c0ba222576636e3e/stateful_tests__release__actions_.html |
Maybe it has been run on an old Linux kernel (although it's unlikely because we are using AWS machines). |
|
Previously tests failed because the completion ring would overflow, so I tried increasing it in the last commit, but now it gives I think the kernel version should be fine, as support for |
|
Ok. Let's resolve the conflicts and continue... |
|
Let's continue. |
|
It hangs somewhere... |
|
@sauliusvl io_uring is most likely not instrumented by MSan. In this case, you need to do |
d192c69 to
f17f82d
Compare
…hecking, reduce size of io_uring
…t throw from monitor thread
|
Instrumented code with |
|
OK, so I think the remaining failed tests are unrelated to the PR? One is about a missing S3 multipart upload, one is a flaky test (does not use |
|
Yes, looks ready! |
Known issue, flaky test (large multipart upload and minio).
The log is terminated, most likely the spot instance was killed.
Known issue, fixed in master. |
alexey-milovidov
left a comment
There was a problem hiding this comment.
Thank you! This is amazing!
|
Hung Check in Stress Tests started to fail in master after this PR was merged. |
|
well that was short lived :D wonder why tests succeeded for the PR, should I try re-opening it and see if they fail? |
Maybe because Stress Tests are not really deterministic and you were just lucky :) |
|
@sauliusvl Please help to submit this PR again. We want it to be merged 👍 |




Changelog category:
Changelog entry:
Add new
local_filesystem_read_methodmethodio_uringbased on the asynchronous Linux io_uring subsystem, improving read performance almost universally compared to the defaultpreadmethod.Overview
The reader is implemented as an
IAsynchronousReader, each I/O request is enqueued to the submission ring, a separate monitor thread reaps I/O completions from the completion queue and completes queued futures. System support forio_uringis checked upon first request, reads fail if not on Linux orio_uringis not available (minimum kernel version is 5.6).Performance Benchmarks
All tests were performed using
select count(ignore(*)) from visitsqueries. For parallel queries I made 10 copies of the samevisitstable. Tested on my i7-7700K / 32GB desktop with a 7200rpm WD HDD disk.To compare raw throughput I ran 1 query at a time ~50 times clearing the page cache before each run,
io_uringis statistically significantly faster in all cases except forO_DIRECTwithout prefetch (min_bytes_to_use_direct_io = 1, local_filesystem_read_prefetch = 0), where performance is the same, however CPU usage is drastically smaller:Next I tried running more queries in parallel, querying 10 identical tables at once, without direct IO and without prefetch, clearing caches before each run and repeating everything multiple times:
Again
io_uringperforms faster on average, with less variability and better percentiles, the advantage becoming more significant with increasing parallelism. It also appears to be more resource effective in terms of CPU usage, here we measurecpu_non_io_sec = (OSCPUVirtualTimeMicroseconds + OSCPUWaitMicroseconds) / 10e6running on 8 cores:Since
io_uringis asynchronous, technically there is no CPU I/O wait and we observeOSIOWaitMicroseconds = 0, here's howhtoplooks like when running 10 parallel queries usingio_uringvs.pread:Running the same tests with a warmed up page cache I was not able to observe any statistically significant differences neither in query duration neither in CPU times.
Resolves #10787