Add support for io_uring read method by sauliusvl · Pull Request #36103 · ClickHouse/ClickHouse

sauliusvl · 2022-04-10T08:26:16Z

Changelog category:

Performance Improvement

Changelog entry:

Add new local_filesystem_read_method method io_uring based on the asynchronous Linux io_uring subsystem, improving read performance almost universally compared to the default pread method.

Overview

The reader is implemented as an IAsynchronousReader, each I/O request is enqueued to the submission ring, a separate monitor thread reaps I/O completions from the completion queue and completes queued futures. System support for io_uring is checked upon first request, reads fail if not on Linux or io_uring is not available (minimum kernel version is 5.6).

Performance Benchmarks

All tests were performed using select count(ignore(*)) from visits queries. For parallel queries I made 10 copies of the same visits table. Tested on my i7-7700K / 32GB desktop with a 7200rpm WD HDD disk.

To compare raw throughput I ran 1 query at a time ~50 times clearing the page cache before each run, io_uring is statistically significantly faster in all cases except for O_DIRECT without prefetch (min_bytes_to_use_direct_io = 1, local_filesystem_read_prefetch = 0), where performance is the same, however CPU usage is drastically smaller:

+----------------------+---------------+---------------+-------------+-------------+-----------+
| direct_io / prefetch |     pread     |   io_uring    | improvement | significant | cpu_usage |
+----------------------+---------------+---------------+-------------+-------------+-----------+
| no  / no             | 10.75 ± 0.47s |  9.91 ± 0.49s |       7.75% |         yes |    -1.29% |
| no  / yes            |  8.86 ± 0.23s |  7.85 ± 0.31s |      11.48% |         yes |    -2.86% |
| yes / yes            | 14.41 ± 0.57s | 11.49 ± 0.28s |      20.27% |         yes |    -1.74% |
| yes / no             | 14.37 ± 0.50s | 14.34 ± 0.47s |       0.25% |          no |   -24.52% |
+----------------------+---------------+---------------+-------------+-------------+-----------+

Next I tried running more queries in parallel, querying 10 identical tables at once, without direct IO and without prefetch, clearing caches before each run and repeating everything multiple times:

Again io_uring performs faster on average, with less variability and better percentiles, the advantage becoming more significant with increasing parallelism. It also appears to be more resource effective in terms of CPU usage, here we measure cpu_non_io_sec = (OSCPUVirtualTimeMicroseconds + OSCPUWaitMicroseconds) / 10e6 running on 8 cores:

Since io_uring is asynchronous, technically there is no CPU I/O wait and we observe OSIOWaitMicroseconds = 0, here's how htop looks like when running 10 parallel queries using io_uring vs. pread:

Running the same tests with a warmed up page cache I was not able to observe any statistically significant differences neither in query duration neither in CPU times.

Resolves #10787

CLAassistant · 2022-04-10T08:26:20Z

All committers have signed the CLA.

alexey-milovidov · 2022-04-11T00:42:47Z

Fast test is using subset of submodules, the list should be updated at docker/fasttest/run.sh

contrib/liburing-cmake/CMakeLists.txt

src/Disks/IO/IOUringReader.cpp

src/Disks/IO/IOUringReader.h

alexey-milovidov · 2022-04-11T00:56:34Z

Let's also compare performance with pread_threadpool, both cold and cached reads.

alexey-milovidov · 2022-04-11T00:57:39Z

We can test this method by settings randomization, see tests/clickhouse-test, class SettingsRandomizer.

sauliusvl · 2022-04-11T18:17:40Z

Ran more tests, added the pread_threadpool method to the suite, here's how it compares in terms of query duration:

And in CPU usage:

Also ran the same tests a lot more times with a warmed up page cache, here io_uring seems to be slower by ~3%, I observed similar results when querying the same data on an NVMe SSD drive:

which I suppose makes sense as there's some overhead in keeping track of promises/requests in a hash map. Maybe short-circuiting using preadv2 is an option here, similarly to the pread_threadpool method. The CPU usage appears to be smaller for io_uring though:

alexey-milovidov · 2022-04-11T21:36:14Z

The results are good, so we can even consider making this method to be default.

warmed up page cache, here io_uring seems to be slower by ~3%

I think 3% is tolerable :)

sauliusvl · 2022-04-12T20:26:17Z

Updated the SQE submission error handling part. Is there something else missing? Should I worry about the failing tests?

sauliusvl · 2022-04-25T10:48:04Z

@alexey-milovidov ping?

alexey-milovidov · 2022-04-25T17:49:34Z

According to the tests it does not work at all: https://s3.amazonaws.com/clickhouse-test-reports/36103/1142f0c70ff9b9fcba5fa944c0ba222576636e3e/stateful_tests__release__actions_.html

alexey-milovidov · 2022-05-09T23:28:37Z

DB::ErrnoException: Failed initializing io_uring, got -12, errno: 0, strerror: Success

Maybe it has been run on an old Linux kernel (although it's unlikely because we are using AWS machines).
The diagnostics like "-12, not zero, success" look misleading.

sauliusvl · 2022-05-11T12:33:24Z

Previously tests failed because the completion ring would overflow, so I tried increasing it in the last commit, but now it gives -ENOMEM when initializing, I guess blindly increasing ring size is not the proper solution here anyway. I'll try working around this issue by monitoring the CQ size and delaying submission if it's full.

I think the kernel version should be fine, as support for io_uring is being checked explicitly in the constructor, though I guess it's a bit older? I can't really reproduce any of these errors on my 5.17 kernel, no matter how hard I try.

alexey-milovidov · 2022-05-19T03:28:58Z

Ok. Let's resolve the conflicts and continue...

alexey-milovidov · 2022-05-25T02:35:56Z

Let's continue.

sauliusvl · 2022-06-07T09:03:26Z

ok, so all the obvious errors are gone, now there's quite a few Timeout dropping database xxx after test failures due to 620s timeouts and also a few sanitizer trap failures, e.g. this and this, will try to understand the reasons for them, but would also appreciate any additional reviews.

alexey-milovidov · 2022-06-15T16:32:36Z

It hangs somewhere...

alexey-milovidov · 2022-06-18T01:27:51Z

@sauliusvl io_uring is most likely not instrumented by MSan. In this case, you need to do __msan_unpoison, see the examples in the code.

…hecking, reduce size of io_uring

…t throw from monitor thread

sauliusvl · 2022-06-21T10:51:24Z

Instrumented code with __msan_unpoison and fixed the timeout issues: turns out the io_uring_enter syscall can sometimes get interrupted and return -EINTR, previously the monitor thread would throw and all subsequent requests would never be completed, now it just retries.

sauliusvl · 2022-06-22T17:26:37Z

OK, so I think the remaining failed tests are unrelated to the PR? One is about a missing S3 multipart upload, one is a flaky test (does not use io_uring) and one "Failed to parse the report.", not sure what's causing that.

alexey-milovidov · 2022-06-23T21:29:19Z

Yes, looks ready!
I will check the remaining checks...

alexey-milovidov · 2022-06-23T21:32:05Z

test_storage_s3/test.py::test_seekable_formats_url

Known issue, flaky test (large multipart upload and minio).

Performance Comparison [2/4] — Failed to parse the report.

The log is terminated, most likely the spot instance was killed.

02305_schema_inference_with_globs

Known issue, fixed in master.

alexey-milovidov

Thank you! This is amazing!

tavplubix · 2022-06-24T09:19:38Z

Hung Check in Stress Tests started to fail in master after this PR was merged.
Example: https://s3.amazonaws.com/clickhouse-test-reports/0/774de099cbd4090113344501286dca5463e85d50/stress_test__memory__actions_.html
There are stacktraces that look related to this PR: https://pastila.nl/?005c9b27/9db6edb26f235dca916e7499fc6fd0d3
I will try to revert it.

sauliusvl · 2022-06-24T18:10:55Z

well that was short lived :D wonder why tests succeeded for the PR, should I try re-opening it and see if they fail?

tavplubix · 2022-06-24T18:28:23Z

wonder why tests succeeded for the PR

Maybe because Stress Tests are not really deterministic and you were just lucky :)
Revert fixed failures in master, so I'm sure that it was related

alexey-milovidov · 2022-06-26T07:09:42Z

@sauliusvl Please help to submit this PR again. We want it to be merged 👍

robot-ch-test-poll added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels Apr 10, 2022

sauliusvl force-pushed the uring branch from 3fd4686 to 49fdd36 Compare April 10, 2022 08:37

nickitat added the can be tested Allows running workflows for external contributors label Apr 10, 2022