Skip to content

Socket level throttling for S3 read and write requests#81837

Merged
serxa merged 30 commits intomasterfrom
socket-level-throttling
Jul 28, 2025
Merged

Socket level throttling for S3 read and write requests#81837
serxa merged 30 commits intomasterfrom
socket-level-throttling

Conversation

@serxa
Copy link
Copy Markdown
Member

@serxa serxa commented Jun 13, 2025

Depends on #81753 and should be reviewed after it is merged

Follows similar logic as #65182 and resolves the same problem: Coarse-grained throttling based on S3 request (or response) size does not work well when request sizes are larger than the token bucket size (usually 1 sec * max_speed B/sec). Throttling such large chunks leads to underutilization of real network bandwidth and idle-waiting of threads on throttlers for multiple seconds. Throttling individual socket operations instead of whole S3 requests solves the problem by distributing traffic consumption over a longer duration. This allows effective replenishment and use of the token bucket volume.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

S3 read and write requests are throttled on the HTTP socket level (instead of whole S3 requests) to avoid issues with max_remote_read_network_bandwidth_for_server and max_remote_write_network_bandwidth_for_server throttling.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 13, 2025

Workflow [PR], commit [29d7544]

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Jun 13, 2025
@serxa serxa requested a review from Copilot June 13, 2025 21:26

This comment was marked as outdated.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 25, 2025

Workflow [PR], commit [e01b689]

Summary:

job_name test_name status info comment
Stateless tests (amd_asan, distributed plan, sequential) failure
00002_log_and_exception_messages_formatting FAIL
Stateless tests (amd_ubsan, parallel) failure
01825_new_type_json_ghdata_insert_select FAIL

@serxa serxa force-pushed the socket-level-throttling branch from 05d76a5 to 29d7544 Compare June 25, 2025 11:03
@CheSema
Copy link
Copy Markdown
Member

CheSema commented Jun 30, 2025

We spoke about that.
The outcome is:

  • try to sleep before consumption
  • make sure that consumption is more granular for reading, we expect read no more that several kilobytes in one call
  • think how is it possible to not sleep too long (longer that read/write timeout) and not sleep after query deadline

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 1, 2025

Addressing review:

  • try to sleep before consumption
  1. I've moved throttling to places before syscalls.
  2. Added budget because we don't know how much resource will be needed before syscall
  3. Added minimum amount to request through throttler (32KB) to avoid too frequent calls, because the current implementation has a mutex. Rough benchmarks show that 32KB should be enough for 40Gbps network throttler to never be a bottleneck itself (details)
  • make sure that consumption is more granular for reading, we expect read no more that several kilobytes in one call
  • think how is it possible to not sleep too long (longer that read/write timeout) and not sleep after query deadline

Not addressed yet

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 4, 2025

think how is it possible to not sleep too long (longer that read/write timeout) and not sleep after query deadline

  1. I've added support for limits inIThrottler::throttle() and passed half of socket recv/send timeouts there. I think this should be good to avoid any timeout due to throttling.
  2. Regarding query deadlines I think it should be handled in a separate PR by adding cancellation points into all blocking calls, including sleepForNanoseconds() which is used by throttlers. Now it is prepared for signals but just reenters sleep, so this is a good point for cancellation exception to be thrown. To support it I made DB::Throttler code exception-safe in where it calls sleep.

Copy link
Copy Markdown
Member

@CheSema CheSema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
But tests are broken

@serxa serxa requested a review from Copilot July 18, 2025 15:00

This comment was marked as outdated.

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 18, 2025

I have found the problem with tests. It's an unsigned integer overflow that happens if max_block_us == size_t(-1)

❯ utils/c++expr -I -i utility -i base/types.h 'static constexpr auto NS = 1000000000UL; double tokens_value = -48576.0; size_t max_block_us = 18446744073709551615uz; size_t max_speed_value = 1000000; Int64 block_ns = static_cast<Int64>(-tokens_value / max_speed_value * NS); OUT(std::min<Int64>(max_block_us * 1000, block_ns)) OUT(block_ns)'
std::min<Int64>(max_block_us * 1000, block_ns) -> -1000 <====== negative value passed forward to sleep function taking UInt64
block_ns -> 48576000

UPD. Interestingly enough, Copilot was able to find it as well. I'd better ask it before I figure it out by myself...

@serxa serxa requested a review from Copilot July 18, 2025 15:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements socket-level throttling for S3 read and write requests to address issues with coarse-grained throttling when request sizes exceed token bucket capacity. The change moves throttling from the S3 request level to individual socket operations, enabling better distribution of traffic consumption and more effective use of token bucket volumes.

  • Refactors throttler interface from add() to throttle() method with improved blocking semantics
  • Implements socket-level throttling in Poco networking stack with budget-based approach
  • Integrates throttlers into HTTP connection pool and S3 operations via thread-local scopes

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/Common/Throttler.h Updates throttler interface to use throttle() method instead of add()
src/Common/Throttler.cpp Implements new throttling logic with blocking semantics and thread safety
src/IO/WriteBufferFromS3.cpp Replaces request-level throttling with socket-level throttling scopes
src/IO/ReadBufferFromS3.cpp Removes manual throttling calls, delegates to socket-level implementation
base/poco/Net/src/SocketImpl.cpp Adds socket-level throttling with budget management
src/Common/CurrentThread.h Adds throttling scope classes for read/write operations
src/Common/HTTPConnectionPool.cpp Integrates throttlers into HTTP session management
Multiple buffer files Updates throttler method calls from add() to throttle()

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 23, 2025

With the help of integration tests debugging I was able to find and fix the rest of the issues with tests:

  1. The connection pool uses HTTPSessionClient::assign() to hijack the socket. Now, the throttlers are properly set afterwards.
  2. The SecureSocketImpl does not use direct recv or send calls but instead uses SSL_read and SSL_write. Now these calls are wrapped with the throttler is exactly the same way as for SocketImpl.

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 25, 2025

PR / Integration tests (release, 3/4) (pull_request)Failing after 122m

This is a bit strange. I checked all the *.log files, and they are all successful and fast, but there is an error in the job logs. @tavplubix PTAL

INFO:root:All tests from group parallel7 passed
INFO:root:Tests from group parallel7 stats, ERROR count 0
INFO:root:Totally have 0 with status ERROR
INFO:root:Tests from group parallel7 stats, PASSED count 71
INFO:root:Totally have 940 with status PASSED
INFO:root:Tests from group parallel7 stats, FAILED count 0
INFO:root:Totally have 0 with status FAILED
INFO:root:Tests from group parallel7 stats, SKIPPED count 0
INFO:root:Totally have 0 with status SKIPPED
INFO:root:Tests from group parallel7 stats, BROKEN count 0
INFO:root:Totally have 0 with status BROKEN
INFO:root:Tests from group parallel7 stats, NOT_FAILED count 0
INFO:root:Totally have 0 with status NOT_FAILED
INFO:root:Totally finished tests 940/940
INFO:root:Overall success!
ERROR:root:Job killed by external timeout signal - setting status to failure!
INFO:root:Tests finished
INFO:root:Dumping dmesg
[Fri Jul 25 00:18:47 2025] br-38e14867abd6: port 1(veth4e56134) entered forwarding state
[Fri Jul 25 00:18:58 2025] br-38e14867abd6: port 1(veth4e56134) entered disabled state
[Fri Jul 25 00:18:58 2025] veth833236c: renamed from eth0
[Fri Jul 25 00:18:58 2025] br-38e14867abd6: port 1(veth4e56134) entered disabled state
[Fri Jul 25 00:18:58 2025] veth4e56134 (unregistering): left allmulticast mode
[Fri Jul 25 00:18:58 2025] veth4e56134 (unregistering): left promiscuous mode
[Fri Jul 25 00:18:58 2025] br-38e14867abd6: port 1(veth4e56134) entered disabled state
[Fri Jul 25 00:22:31 2025] br-d52f149df01e: port 5(veth0a42026) entered disabled state
[Fri Jul 25 00:22:31 2025] veth5573740: renamed from eth0
[Fri Jul 25 00:22:31 2025] br-d52f149df01e: port 4(veth8afaa0f) entered disabled state
[Fri Jul 25 00:22:31 2025] veth8e918f4: renamed from eth0
[Fri Jul 25 00:22:31 2025] br-d52f149df01e: port 5(veth0a42026) entered disabled state
[Fri Jul 25 00:22:31 2025] veth0a42026 (unregistering): left allmulticast mode
[Fri Jul 25 00:22:31 2025] veth0a42026 (unregistering): left promiscuous mode

@serxa
Copy link
Copy Markdown
Member Author

serxa commented Jul 28, 2025

All failed tests are unrelated

PR / Stateless tests (amd_asan, distributed plan, sequential) (pull_request)Failing after 74m

00002_log_and_exception_messages_formatting #84538
01825_new_type_json_ghdata_insert_select #84577

@serxa serxa added this pull request to the merge queue Jul 28, 2025
Merged via the queue into master with commit e8935e4 Jul 28, 2025
123 of 125 checks passed
@serxa serxa deleted the socket-level-throttling branch July 28, 2025 23:21
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 28, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 29, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 29, 2025
baibaichen pushed a commit to apache/gluten that referenced this pull request Jul 29, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250729)

* Fix build due to ClickHouse/ClickHouse#76802

* Fix build due to ClickHouse/ClickHouse#81837

* Fix build due to ClickHouse/ClickHouse#84011

* Fix gtest due to ClickHouse/ClickHouse#83599

---------

Co-authored-by: kyligence-git <gluten@kyligence.io>
Co-authored-by: Chang chen <chenchang@apache.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants