IO scheduling on HTTP session level by serxa · Pull Request #65182 · ClickHouse/ClickHouse

serxa · 2024-06-12T15:01:53Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

IO scheduling for remote S3 disks is now done on the level of HTTP socket streams (instead of the whole S3 requests) to resolve bandwidth_limit throttling issues.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

Do not test
Upload binaries for special builds
Disable merge-commit
Disable CI cache

robot-ch-test-poll2 · 2024-06-12T15:04:46Z

This is an automated comment for commit c6aa12f with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

…se into s3-streams-scheduler

serxa · 2024-06-18T11:29:52Z

base/poco/Net/src/SocketImpl.cpp

It is just spaces because I ended up not changing this file. We can revert the changes here if necessary.

serxa · 2024-06-18T11:31:51Z

src/Common/CurrentThread.h

    };
+
+    /// Scoped attach/detach of IO resource links
+    struct IOScope : private boost::noncopyable


I think we will use this scope eventually for all IO operations including local, hence the name.

Do we haves cases with only one type of resources attached? Or they are used together always?

For local IO it could be only read or write.

serxa · 2024-06-18T11:33:57Z

src/Common/HTTPConnectionPool.cpp

+    }
+
+    ResourceLink link;
+    ResourceGuard::Request request;


Technically we could have used TLS request as ResourceGuard does and avoid any allocations, but it feels just safer and more straightforward to me.

src/Common/HTTPConnectionPool.cpp

serxa · 2024-06-18T11:40:58Z

src/Common/Scheduler/ISchedulerQueue.h


-    // Adjust budget to account for extra consumption of `cost` resource units
-    void consumeBudget(ResourceCost cost)
-    {
-        adjustBudget(0, cost);
-    }
-
-    // Adjust budget to account for requested, but not consumed `cost` resource units
-    void accumulateBudget(ResourceCost cost)
-    {
-        adjustBudget(cost, 0);
-    }
-
    /// Enqueue new request to be executed using underlying resource.


It is not needed anymore, now logic integrated into ResourceGuard

serxa · 2024-06-18T11:46:58Z

src/Common/Scheduler/ResourceGuard.h

            chassert(state == Dequeued);
            state = Finished;
+            if (estimated_cost != real_cost_)
+                link_.queue->adjustBudget(estimated_cost, real_cost_);


There was a bug. All ResourceGuard were used in the wrong way passing request->cost into adjustBudget() but enqueueRequestUsingBudget changes request->cost before enqueing the request, so the correct initial estimated cost was not stored anywhere, but it is required for correct adjustment of the budget. Now we store it in ResourceGuard and it's usage is much simpler and less error-prone.

base/poco/Net/include/Poco/Net/HTTPSession.h

src/Common/HTTPConnectionPool.cpp

src/Disks/IO/ReadBufferFromAzureBlobStorage.cpp

CheSema · 2024-06-24T13:53:48Z

src/IO/WriteBufferFromS3.cpp


-            ResourceCost cost = request.GetContentLength();
-            ResourceGuard rlock(write_settings.resource_link, cost);
+            CurrentThread::IOScope io_scope(write_settings.io_scheduling);


Here you are controlling only particular current tread with the exact call. In other words you enable scheduling only for PUT (single object + multi parts) and GET (get object).
Also you manually set resources guards for Azure and HDFS.

But what about the rest? I expected that you cover here HTTP storages as well with other types of requests.

BTW, do you intend to control communication with clickhouse disks or with (S3|Azure|Web)Functions as well?

Yes, it should cover all remote and local IO eventually. In this PR I'm just trying to improve what we already have. Let's also do the following in a separate PR:

move CurrentThread::IOScope into ObjectStorage

make IOScope inheritable when spawning new threads

get rid of ReadSettings and WriteSettings fields for passing io_scheduling field

It should make the process of integration other buffers and scheduler more straightforward

src/Common/HTTPConnectionPool.cpp

serxa · 2024-07-08T19:14:48Z

@maxknv Can we help build a check to finish? How can I find out what when wrong?

Pending — 13 of 21 builds are missing. 18/31 artifact groups are OK

maxknv · 2024-07-10T12:06:38Z

@maxknv Can we help build a check to finish? How can I find out what when wrong?
Pending — 13 of 21 builds are missing. 18/31 artifact groups are OK

package_aarch64 failed with OOM. - I suggest restarting it manually

serxa · 2024-07-12T09:39:23Z

Looks like we have data race here now https://pastila.nl/?000777c6/1c25eebbf78952d5ed8d7c42cbcca009#or9ialwCHfjZi17MNTG5Hg==

serxa · 2024-08-02T13:14:43Z

src/Common/HTTPConnectionPool.cpp

            }
            response_stream = nullptr;
+            Session::setSendDataHooks();
+            Session::setReceiveDataHooks();


This reset lead to a data race: https://pastila.nl/?000777c6/1c25eebbf78952d5ed8d7c42cbcca009#or9ialwCHfjZi17MNTG5Hg==

So we cannot reset it here, because session could be still in use for some reason. I think I should return to the logic I used a few revisions earlier: reset hooks when the connection is reassigned to another request in sendRequest().

It turned out to be correct. Data race was inside ResourceGuard

alexey-milovidov · 2024-08-02T20:59:37Z

Stateless tests (release) — Some queries hung, fail: 1, passed: 6833, skipped: 6

#37686

clickhouse-cloud :) SELECT pull_request_number, commit_sha, check_name, event_time, message FROM merge('^text_log') WHERE message_format_string = 'Received EINTR while trying to drain a TimerDescriptor, fd {}: {}' AND event_date >= yesterday()

SELECT
    pull_request_number,
    commit_sha,
    check_name,
    event_time,
    message
FROM merge('^text_log')
WHERE (message_format_string = 'Received EINTR while trying to drain a TimerDescriptor, fd {}: {}') AND (event_date >= yesterday())

Query id: 045b3f12-86b9-437c-9ce2-deb6a972c129

   ┌─pull_request_number─┬─commit_sha───────────────────────────────┬─check_name────────────────┬──────────event_time─┬─message─────────────────────────────────────────────────────────────────────────────┐
1. │               65182 │ 3bff7ddcf8891d091bc5be2b827172029fb8b76f │ Stateless tests (release) │ 2024-08-02 18:17:39 │ Received EINTR while trying to drain a TimerDescriptor, fd 84: anon_inode:[timerfd] │
   └─────────────────────┴──────────────────────────────────────────┴───────────────────────────┴─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘
   ┌─pull_request_number─┬─commit_sha───────────────────────────────┬─check_name────────────────┬──────────event_time─┬─message─────────────────────────────────────────────────────────────────────────────┐
2. │               65182 │ 3bff7ddcf8891d091bc5be2b827172029fb8b76f │ Stateless tests (release) │ 2024-08-02 18:30:47 │ Received EINTR while trying to drain a TimerDescriptor, fd 84: anon_inode:[timerfd] │
   └─────────────────────┴──────────────────────────────────────────┴───────────────────────────┴─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘
   ┌─pull_request_number─┬─commit_sha───────────────────────────────┬─check_name────────────────┬──────────event_time─┬─message─────────────────────────────────────────────────────────────────────────────┐
3. │               65182 │ 3bff7ddcf8891d091bc5be2b827172029fb8b76f │ Stateless tests (release) │ 2024-08-02 17:46:36 │ Received EINTR while trying to drain a TimerDescriptor, fd 84: anon_inode:[timerfd] │
   └─────────────────────┴──────────────────────────────────────────┴───────────────────────────┴─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 86.757 sec. Processed 19.62 billion rows, 78.44 GB (226.11 million rows/s., 904.14 MB/s.)
Peak memory usage: 233.93 MiB.

serxa · 2024-08-03T09:39:35Z

data race is not fixed. it just moved few lines below to a different location https://pastila.nl/?000d4920/623b6fe76e20a416a7c021a625ec3d4b#BDtMc1k+7LXEcMezAHBJhA==

serxa · 2024-08-30T19:36:50Z

Now data race should be fixed. It was in ResourceGuard::Request. It turned out that the common optimization advise "signal your condintion_variable outside critical section" is wrong. The other thread that should wait on cv could find your updated state and not wait at all. If it then destroys the cv, the first thread (signalling) will wake up and signal the destroyed cv. Make sure you don't fall into this trap while optimizing.

        void execute() override
        {
            {
                std::unique_lock lock(mutex);
                chassert(state == Enqueued);
                state = Dequeued;
            }
            dequeued_cv.notify_one();  // This optimization is TOTALLY WRONG. it must be under lock
        }

serxa added 2 commits June 12, 2024 14:49

IO scheduling on HTTP session level

aad55ab

Merge branch 'master' into s3-streams-scheduler

19c2f71

robot-ch-test-poll2 added the pr-improvement Pull request with some product improvements label Jun 12, 2024

CheSema self-assigned this Jun 12, 2024

serxa and others added 9 commits June 13, 2024 14:27

fix mixed read and write links

eb3c619

add test for granularity and total byte size of resource requests

f1f354f

Automatic style fix

b74cf1a

make ResourceGuards more straight-forward

937e170

bugfix: wrong estimated cost passed for budget adjusting

a5eeeb3

add test for scheduler queue budget

ec71d35

add metrics and budget checks

709ef2b

Merge branch 's3-streams-scheduler' of github.com:ClickHouse/ClickHou…

eaf91c1

…se into s3-streams-scheduler

Merge branch 'master' into s3-streams-scheduler

e15b1fb

serxa marked this pull request as ready for review June 14, 2024 16:14

robot-clickhouse and others added 4 commits June 14, 2024 16:18

Automatic style fix

f3c3d41

Merge branch 'master' into s3-streams-scheduler

b598c3e

Merge branch 's3-streams-scheduler' of github.com:ClickHouse/ClickHou…

31795bd

…se into s3-streams-scheduler

Merge branch 'master' into s3-streams-scheduler

ea21f61

serxa commented Jun 18, 2024

View reviewed changes

serxa added 3 commits June 18, 2024 11:52

style

a9bfaf6

Merge branch 'master' into s3-streams-scheduler

3458087

fix typo, renames

7a01b81

serxa mentioned this pull request Jun 20, 2024

Read IO cost estimation #63214

Closed

CheSema reviewed Jun 24, 2024

View reviewed changes

base/poco/Net/include/Poco/Net/HTTPSession.h Outdated Show resolved Hide resolved

CheSema reviewed Jun 24, 2024

View reviewed changes

src/Common/HTTPConnectionPool.cpp Show resolved Hide resolved

CheSema reviewed Jun 24, 2024

View reviewed changes

src/Disks/IO/ReadBufferFromAzureBlobStorage.cpp Show resolved Hide resolved

CheSema reviewed Jun 24, 2024

View reviewed changes

src/Common/HTTPConnectionPool.cpp Show resolved Hide resolved

Merge branch 'master' into s3-streams-scheduler

d7ee82b

serxa added 3 commits July 5, 2024 10:25

reset session send/recv hooks at connection destruction

54fda05

rename IHTTPSessionDataHooks methods

3fa5ad9

add logging of too long resource requests for http sessions

fefcc52

serxa requested a review from CheSema July 5, 2024 11:16

Merge branch 'master' into s3-streams-scheduler

8724c5a

serxa commented Aug 2, 2024

View reviewed changes

fix data race: delay reset of data hooks until the next sendRequest()

3bff7dd

serxa added 3 commits August 30, 2024 19:26

revert wrong data race fix

30dd823

Fix data race in ResourceGuard

3675e83

cleanup

9f96d18

Merge branch 'master' into s3-streams-scheduler

c6aa12f

CheSema approved these changes Sep 2, 2024

View reviewed changes

serxa added this pull request to the merge queue Sep 2, 2024

Merged via the queue into master with commit 1f5082e Sep 2, 2024

serxa deleted the s3-streams-scheduler branch September 2, 2024 14:58

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Sep 2, 2024

serxa mentioned this pull request Jun 13, 2025

Socket level throttling for S3 read and write requests #81837

Merged

serxa mentioned this pull request Aug 3, 2025

[Umbrella] Workload Scheduling #84976

Open

29 tasks

Conversation

serxa commented Jun 12, 2024 • edited by CheSema Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI Settings (Only check the boxes if you know what you are doing):

Uh oh!

robot-ch-test-poll2 commented Jun 12, 2024 • edited by robot-ch-test-poll1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serxa Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serxa commented Jul 8, 2024

Uh oh!

maxknv commented Jul 10, 2024

Uh oh!

serxa commented Jul 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serxa commented Aug 3, 2024

Uh oh!

serxa commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

serxa commented Jun 12, 2024 •

edited by CheSema

Loading

robot-ch-test-poll2 commented Jun 12, 2024 •

edited by robot-ch-test-poll1

Loading

serxa Jul 5, 2024 •

edited

Loading

alexey-milovidov commented Aug 2, 2024 •

edited

Loading

serxa commented Aug 30, 2024 •

edited

Loading