Skip to content

IO scheduling on HTTP session level#65182

Merged
serxa merged 28 commits intomasterfrom
s3-streams-scheduler
Sep 2, 2024
Merged

IO scheduling on HTTP session level#65182
serxa merged 28 commits intomasterfrom
s3-streams-scheduler

Conversation

@serxa
Copy link
Copy Markdown
Member

@serxa serxa commented Jun 12, 2024

Fixes #63213

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

IO scheduling for remote S3 disks is now done on the level of HTTP socket streams (instead of the whole S3 requests) to resolve bandwidth_limit throttling issues.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: Normal Builds
  • Allow: Special Builds
  • Allow: All NOT Required Checks
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Do not test
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-improvement Pull request with some product improvements label Jun 12, 2024
@robot-ch-test-poll2
Copy link
Copy Markdown
Contributor

robot-ch-test-poll2 commented Jun 12, 2024

This is an automated comment for commit c6aa12f with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@CheSema CheSema self-assigned this Jun 12, 2024
@serxa serxa marked this pull request as ready for review June 14, 2024 16:14
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just spaces because I ended up not changing this file. We can revert the changes here if necessary.

};

/// Scoped attach/detach of IO resource links
struct IOScope : private boost::noncopyable
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will use this scope eventually for all IO operations including local, hence the name.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we haves cases with only one type of resources attached? Or they are used together always?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local IO it could be only read or write.

}

ResourceLink link;
ResourceGuard::Request request;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could have used TLS request as ResourceGuard does and avoid any allocations, but it feels just safer and more straightforward to me.

Comment on lines 37 to 40

// Adjust budget to account for extra consumption of `cost` resource units
void consumeBudget(ResourceCost cost)
{
adjustBudget(0, cost);
}

// Adjust budget to account for requested, but not consumed `cost` resource units
void accumulateBudget(ResourceCost cost)
{
adjustBudget(cost, 0);
}

/// Enqueue new request to be executed using underlying resource.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed anymore, now logic integrated into ResourceGuard

chassert(state == Dequeued);
state = Finished;
if (estimated_cost != real_cost_)
link_.queue->adjustBudget(estimated_cost, real_cost_);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug. All ResourceGuard were used in the wrong way passing request->cost into adjustBudget() but enqueueRequestUsingBudget changes request->cost before enqueing the request, so the correct initial estimated cost was not stored anywhere, but it is required for correct adjustment of the budget. Now we store it in ResourceGuard and it's usage is much simpler and less error-prone.

@serxa serxa mentioned this pull request Jun 20, 2024

ResourceCost cost = request.GetContentLength();
ResourceGuard rlock(write_settings.resource_link, cost);
CurrentThread::IOScope io_scope(write_settings.io_scheduling);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are controlling only particular current tread with the exact call. In other words you enable scheduling only for PUT (single object + multi parts) and GET (get object).
Also you manually set resources guards for Azure and HDFS.

But what about the rest? I expected that you cover here HTTP storages as well with other types of requests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do you intend to control communication with clickhouse disks or with (S3|Azure|Web)Functions as well?

Copy link
Copy Markdown
Member Author

@serxa serxa Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should cover all remote and local IO eventually. In this PR I'm just trying to improve what we already have. Let's also do the following in a separate PR:

  1. move CurrentThread::IOScope into ObjectStorage
  2. make IOScope inheritable when spawning new threads
  3. get rid of ReadSettings and WriteSettings fields for passing io_scheduling field

It should make the process of integration other buffers and scheduler more straightforward

@serxa serxa requested a review from CheSema July 5, 2024 11:16
@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 8, 2024

@maxknv Can we help build a check to finish? How can I find out what when wrong?

Pending — 13 of 21 builds are missing. 18/31 artifact groups are OK

@maxknv
Copy link
Copy Markdown
Member

maxknv commented Jul 10, 2024

@maxknv Can we help build a check to finish? How can I find out what when wrong?

Pending — 13 of 21 builds are missing. 18/31 artifact groups are OK

package_aarch64 failed with OOM. - I suggest restarting it manually

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 12, 2024

}
response_stream = nullptr;
Session::setSendDataHooks();
Session::setReceiveDataHooks();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reset lead to a data race: https://pastila.nl/?000777c6/1c25eebbf78952d5ed8d7c42cbcca009#or9ialwCHfjZi17MNTG5Hg==

So we cannot reset it here, because session could be still in use for some reason. I think I should return to the logic I used a few revisions earlier: reset hooks when the connection is reassigned to another request in sendRequest().

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out to be correct. Data race was inside ResourceGuard

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Aug 2, 2024

Stateless tests (release) — Some queries hung, fail: 1, passed: 6833, skipped: 6

#37686

clickhouse-cloud :) SELECT pull_request_number, commit_sha, check_name, event_time, message FROM merge('^text_log') WHERE message_format_string = 'Received EINTR while trying to drain a TimerDescriptor, fd {}: {}' AND event_date >= yesterday()

SELECT
    pull_request_number,
    commit_sha,
    check_name,
    event_time,
    message
FROM merge('^text_log')
WHERE (message_format_string = 'Received EINTR while trying to drain a TimerDescriptor, fd {}: {}') AND (event_date >= yesterday())

Query id: 045b3f12-86b9-437c-9ce2-deb6a972c129

   ┌─pull_request_number─┬─commit_sha───────────────────────────────┬─check_name────────────────┬──────────event_time─┬─message─────────────────────────────────────────────────────────────────────────────┐
1. │               65182 │ 3bff7ddcf8891d091bc5be2b827172029fb8b76f │ Stateless tests (release) │ 2024-08-02 18:17:39 │ Received EINTR while trying to drain a TimerDescriptor, fd 84: anon_inode:[timerfd] │
   └─────────────────────┴──────────────────────────────────────────┴───────────────────────────┴─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘
   ┌─pull_request_number─┬─commit_sha───────────────────────────────┬─check_name────────────────┬──────────event_time─┬─message─────────────────────────────────────────────────────────────────────────────┐
2. │               65182 │ 3bff7ddcf8891d091bc5be2b827172029fb8b76f │ Stateless tests (release) │ 2024-08-02 18:30:47 │ Received EINTR while trying to drain a TimerDescriptor, fd 84: anon_inode:[timerfd] │
   └─────────────────────┴──────────────────────────────────────────┴───────────────────────────┴─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘
   ┌─pull_request_number─┬─commit_sha───────────────────────────────┬─check_name────────────────┬──────────event_time─┬─message─────────────────────────────────────────────────────────────────────────────┐
3. │               65182 │ 3bff7ddcf8891d091bc5be2b827172029fb8b76f │ Stateless tests (release) │ 2024-08-02 17:46:36 │ Received EINTR while trying to drain a TimerDescriptor, fd 84: anon_inode:[timerfd] │
   └─────────────────────┴──────────────────────────────────────────┴───────────────────────────┴─────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘

3 rows in set. Elapsed: 86.757 sec. Processed 19.62 billion rows, 78.44 GB (226.11 million rows/s., 904.14 MB/s.)
Peak memory usage: 233.93 MiB.

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Aug 3, 2024

data race is not fixed. it just moved few lines below to a different location https://pastila.nl/?000d4920/623b6fe76e20a416a7c021a625ec3d4b#BDtMc1k+7LXEcMezAHBJhA==

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Aug 30, 2024

Now data race should be fixed. It was in ResourceGuard::Request. It turned out that the common optimization advise "signal your condintion_variable outside critical section" is wrong. The other thread that should wait on cv could find your updated state and not wait at all. If it then destroys the cv, the first thread (signalling) will wake up and signal the destroyed cv. Make sure you don't fall into this trap while optimizing.

        void execute() override
        {
            {
                std::unique_lock lock(mutex);
                chassert(state == Enqueued);
                state = Dequeued;
            }
            dequeued_cv.notify_one();  // This optimization is TOTALLY WRONG. it must be under lock
        }

@serxa serxa added this pull request to the merge queue Sep 2, 2024
Merged via the queue into master with commit 1f5082e Sep 2, 2024
@serxa serxa deleted the s3-streams-scheduler branch September 2, 2024 14:58
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Sep 2, 2024
@serxa serxa mentioned this pull request Aug 3, 2025
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Smaller granularity of resource requests for S3 buffers

7 participants