Skip to content

Add support for io_uring read method#36103

Merged
alexey-milovidov merged 7 commits intoClickHouse:masterfrom
sauliusvl:uring
Jun 23, 2022
Merged

Add support for io_uring read method#36103
alexey-milovidov merged 7 commits intoClickHouse:masterfrom
sauliusvl:uring

Conversation

@sauliusvl
Copy link
Copy Markdown
Contributor

@sauliusvl sauliusvl commented Apr 10, 2022

Changelog category:

  • Performance Improvement

Changelog entry:

Add new local_filesystem_read_method method io_uring based on the asynchronous Linux io_uring subsystem, improving read performance almost universally compared to the default pread method.

Overview

The reader is implemented as an IAsynchronousReader, each I/O request is enqueued to the submission ring, a separate monitor thread reaps I/O completions from the completion queue and completes queued futures. System support for io_uring is checked upon first request, reads fail if not on Linux or io_uring is not available (minimum kernel version is 5.6).

Performance Benchmarks

All tests were performed using select count(ignore(*)) from visits queries. For parallel queries I made 10 copies of the same visits table. Tested on my i7-7700K / 32GB desktop with a 7200rpm WD HDD disk.

To compare raw throughput I ran 1 query at a time ~50 times clearing the page cache before each run, io_uring is statistically significantly faster in all cases except for O_DIRECT without prefetch (min_bytes_to_use_direct_io = 1, local_filesystem_read_prefetch = 0), where performance is the same, however CPU usage is drastically smaller:

+----------------------+---------------+---------------+-------------+-------------+-----------+
| direct_io / prefetch |     pread     |   io_uring    | improvement | significant | cpu_usage |
+----------------------+---------------+---------------+-------------+-------------+-----------+
| no  / no             | 10.75 ± 0.47s |  9.91 ± 0.49s |       7.75% |         yes |    -1.29% |
| no  / yes            |  8.86 ± 0.23s |  7.85 ± 0.31s |      11.48% |         yes |    -2.86% |
| yes / yes            | 14.41 ± 0.57s | 11.49 ± 0.28s |      20.27% |         yes |    -1.74% |
| yes / no             | 14.37 ± 0.50s | 14.34 ± 0.47s |       0.25% |          no |   -24.52% |
+----------------------+---------------+---------------+-------------+-------------+-----------+

Next I tried running more queries in parallel, querying 10 identical tables at once, without direct IO and without prefetch, clearing caches before each run and repeating everything multiple times:

image

Again io_uring performs faster on average, with less variability and better percentiles, the advantage becoming more significant with increasing parallelism. It also appears to be more resource effective in terms of CPU usage, here we measure cpu_non_io_sec = (OSCPUVirtualTimeMicroseconds + OSCPUWaitMicroseconds) / 10e6 running on 8 cores:

image

Since io_uring is asynchronous, technically there is no CPU I/O wait and we observe OSIOWaitMicroseconds = 0, here's how htop looks like when running 10 parallel queries using io_uring vs. pread:

image

Running the same tests with a warmed up page cache I was not able to observe any statistically significant differences neither in query duration neither in CPU times.

Resolves #10787

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 10, 2022

CLA assistant check
All committers have signed the CLA.

@robot-ch-test-poll robot-ch-test-poll added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels Apr 10, 2022
@nickitat nickitat added the can be tested Allows running workflows for external contributors label Apr 10, 2022
@alexey-milovidov
Copy link
Copy Markdown
Member

Fast test is using subset of submodules, the list should be updated at docker/fasttest/run.sh

@alexey-milovidov
Copy link
Copy Markdown
Member

Let's also compare performance with pread_threadpool, both cold and cached reads.

@alexey-milovidov alexey-milovidov self-assigned this Apr 11, 2022
@alexey-milovidov
Copy link
Copy Markdown
Member

We can test this method by settings randomization, see tests/clickhouse-test, class SettingsRandomizer.

@sauliusvl
Copy link
Copy Markdown
Contributor Author

Ran more tests, added the pread_threadpool method to the suite, here's how it compares in terms of query duration:

image

And in CPU usage:

image

Also ran the same tests a lot more times with a warmed up page cache, here io_uring seems to be slower by ~3%, I observed similar results when querying the same data on an NVMe SSD drive:

image

which I suppose makes sense as there's some overhead in keeping track of promises/requests in a hash map. Maybe short-circuiting using preadv2 is an option here, similarly to the pread_threadpool method. The CPU usage appears to be smaller for io_uring though:

image

@alexey-milovidov
Copy link
Copy Markdown
Member

The results are good, so we can even consider making this method to be default.

warmed up page cache, here io_uring seems to be slower by ~3%

I think 3% is tolerable :)

@sauliusvl
Copy link
Copy Markdown
Contributor Author

Updated the SQE submission error handling part. Is there something else missing? Should I worry about the failing tests?

@sauliusvl
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov ping?

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov
Copy link
Copy Markdown
Member

DB::ErrnoException: Failed initializing io_uring, got -12, errno: 0, strerror: Success

Maybe it has been run on an old Linux kernel (although it's unlikely because we are using AWS machines).
The diagnostics like "-12, not zero, success" look misleading.

@sauliusvl
Copy link
Copy Markdown
Contributor Author

Previously tests failed because the completion ring would overflow, so I tried increasing it in the last commit, but now it gives -ENOMEM when initializing, I guess blindly increasing ring size is not the proper solution here anyway. I'll try working around this issue by monitoring the CQ size and delaying submission if it's full.

I think the kernel version should be fine, as support for io_uring is being checked explicitly in the constructor, though I guess it's a bit older? I can't really reproduce any of these errors on my 5.17 kernel, no matter how hard I try.

@alexey-milovidov
Copy link
Copy Markdown
Member

Ok. Let's resolve the conflicts and continue...

@alexey-milovidov
Copy link
Copy Markdown
Member

Let's continue.

@sauliusvl
Copy link
Copy Markdown
Contributor Author

ok, so all the obvious errors are gone, now there's quite a few Timeout dropping database xxx after test failures due to 620s timeouts and also a few sanitizer trap failures, e.g. this and this, will try to understand the reasons for them, but would also appreciate any additional reviews.

@alexey-milovidov
Copy link
Copy Markdown
Member

It hangs somewhere...

@alexey-milovidov
Copy link
Copy Markdown
Member

@sauliusvl io_uring is most likely not instrumented by MSan. In this case, you need to do __msan_unpoison, see the examples in the code.

@sauliusvl sauliusvl force-pushed the uring branch 2 times, most recently from d192c69 to f17f82d Compare June 20, 2022 19:47
@sauliusvl
Copy link
Copy Markdown
Contributor Author

Instrumented code with __msan_unpoison and fixed the timeout issues: turns out the io_uring_enter syscall can sometimes get interrupted and return -EINTR, previously the monitor thread would throw and all subsequent requests would never be completed, now it just retries.

@sauliusvl
Copy link
Copy Markdown
Contributor Author

OK, so I think the remaining failed tests are unrelated to the PR? One is about a missing S3 multipart upload, one is a flaky test (does not use io_uring) and one "Failed to parse the report.", not sure what's causing that.

@alexey-milovidov
Copy link
Copy Markdown
Member

Yes, looks ready!
I will check the remaining checks...

@alexey-milovidov
Copy link
Copy Markdown
Member

test_storage_s3/test.py::test_seekable_formats_url

Known issue, flaky test (large multipart upload and minio).

Performance Comparison [2/4] — Failed to parse the report.

The log is terminated, most likely the spot instance was killed.

02305_schema_inference_with_globs

Known issue, fixed in master.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is amazing!

@alexey-milovidov alexey-milovidov merged commit 812ab9b into ClickHouse:master Jun 23, 2022
@sauliusvl sauliusvl deleted the uring branch June 24, 2022 06:06
@tavplubix
Copy link
Copy Markdown
Member

Hung Check in Stress Tests started to fail in master after this PR was merged.
Example: https://s3.amazonaws.com/clickhouse-test-reports/0/774de099cbd4090113344501286dca5463e85d50/stress_test__memory__actions_.html
There are stacktraces that look related to this PR: https://pastila.nl/?005c9b27/9db6edb26f235dca916e7499fc6fd0d3
I will try to revert it.

@sauliusvl
Copy link
Copy Markdown
Contributor Author

well that was short lived :D wonder why tests succeeded for the PR, should I try re-opening it and see if they fail?

@tavplubix
Copy link
Copy Markdown
Member

tavplubix commented Jun 24, 2022

wonder why tests succeeded for the PR

Maybe because Stress Tests are not really deterministic and you were just lucky :)
Revert fixed failures in master, so I'm sure that it was related

@alexey-milovidov
Copy link
Copy Markdown
Member

@sauliusvl Please help to submit this PR again. We want it to be merged 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce io_uring support to ClickHouse

6 participants