Skip to content

Reject queries when the server is overloaded#63206

Merged
alexkats merged 1 commit intoClickHouse:masterfrom
alexkats:drop-connections
Apr 7, 2025
Merged

Reject queries when the server is overloaded#63206
alexkats merged 1 commit intoClickHouse:masterfrom
alexkats:drop-connections

Conversation

@alexkats
Copy link
Copy Markdown
Contributor

@alexkats alexkats commented Apr 30, 2024

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Reject queries when the server is overloaded. The decision is made based on the ratio of wait time (OSCPUWaitMicroseconds) to busy time (OSCPUVirtualTimeMicroseconds). The query is dropped with some probability, when this ratio is between min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw (those are query level settings).

@alexkats alexkats requested a review from tavplubix April 30, 2024 19:41
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-performance Pull request with some performance improvements label Apr 30, 2024
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented Apr 30, 2024

This is an automated comment for commit 5577931 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS⏳ pending
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@tavplubix tavplubix self-assigned this May 1, 2024
@alexkats alexkats added the can be tested Allows running workflows for external contributors label May 1, 2024
@alexkats alexkats force-pushed the drop-connections branch 2 times, most recently from b11d72e to 86093b2 Compare May 2, 2024 16:00
@tavplubix
Copy link
Copy Markdown
Member

Stateful tests (tsan) — Invalid check_status.tsv
Details
stderr.log:

WARNING: ThreadSanitizer: data race (pid=364)
  Read of size 8 at 0x55f44dd91ad0 by thread T2:
    #0 ProfileEvents::Counters::getLastValue(StrongTypedef<unsigned long, ProfileEvents::EventTag>) const build_docker/./src/Common/ProfileEvents.h:92:20 (clickhouse+0x17e4b3aa) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #1 DB::checkCPULoad(std::__1::shared_ptr<DB::Context const>) build_docker/./src/Interpreters/executeQuery.cpp:299:58 (clickhouse+0x17e4b3aa)
    #2 DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum) build_docker/./src/Interpreters/executeQuery.cpp:1419:5 (clickhouse+0x17e4aee8) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #3 DB::TCPHandler::runImpl() build_docker/./src/Server/TCPHandler.cpp:522:54 (clickhouse+0x196d3538) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #4 DB::TCPHandler::run() build_docker/./src/Server/TCPHandler.cpp:2341:9 (clickhouse+0x196f1867) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #5 Poco::Net::TCPServerConnection::start() build_docker/./base/poco/Net/src/TCPServerConnection.cpp:43:3 (clickhouse+0x1d277922) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #6 Poco::Net::TCPServerDispatcher::run() build_docker/./base/poco/Net/src/TCPServerDispatcher.cpp:115:20 (clickhouse+0x1d278191) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #7 Poco::PooledThread::run() build_docker/./base/poco/Foundation/src/ThreadPool.cpp:188:14 (clickhouse+0x1d47a7c6) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #8 Poco::(anonymous namespace)::RunnableHolder::run() build_docker/./base/poco/Foundation/src/Thread.cpp:45:11 (clickhouse+0x1d478a8f) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #9 Poco::ThreadImpl::runnableEntry(void*) build_docker/./base/poco/Foundation/src/Thread_POSIX.cpp:335:27 (clickhouse+0x1d476f49) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)

  Previous write of size 8 at 0x55f44dd91ad0 by thread T685:
    #0 std::__1::enable_if<is_move_constructible<unsigned long>::value && is_move_assignable<unsigned long>::value, void>::type std::__1::swap[abi:v15000]<unsigned long>(unsigned long&, unsigned long&) build_docker/./contrib/llvm-project/libcxx/include/__utility/swap.h:36:7 (clickhouse+0x17ca9456) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #1 ProfileEvents::Counters::getAndUpdateLastValue(StrongTypedef<unsigned long, ProfileEvents::EventTag>, unsigned long) const build_docker/./src/Common/ProfileEvents.h:86:13 (clickhouse+0x17ca9456)
    #2 DB::MetricLog::metricThreadFunction() build_docker/./src/Interpreters/MetricLog.cpp:106:87 (clickhouse+0x17ca9456)
    #3 DB::MetricLog::startCollectMetric(unsigned long)::$_0::operator()() const build_docker/./src/Interpreters/MetricLog.cpp:64:75 (clickhouse+0x17ca9ace) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #4 decltype(std::declval<DB::MetricLog::startCollectMetric(unsigned long)::$_0&>()()) std::__1::__invoke[abi:v15000]<DB::MetricLog::startCollectMetric(unsigned long)::$_0&>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394:23 (clickhouse+0x17ca9ace)
    #5 decltype(auto) std::__1::__apply_tuple_impl[abi:v15000]<DB::MetricLog::startCollectMetric(unsigned long)::$_0&, std::__1::tuple<>&>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&, std::__1::tuple<>&, std::__1::__tuple_indices<>) build_docker/./contrib/llvm-project/libcxx/include/tuple:1789:1 (clickhouse+0x17ca9ace)
    #6 decltype(auto) std::__1::apply[abi:v15000]<DB::MetricLog::startCollectMetric(unsigned long)::$_0&, std::__1::tuple<>&>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&, std::__1::tuple<>&) build_docker/./contrib/llvm-project/libcxx/include/tuple:1798:1 (clickhouse+0x17ca9ace)
    #7 ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::MetricLog::startCollectMetric(unsigned long)::$_0>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&)::'lambda'()::operator()() build_docker/./src/Common/ThreadPool.h:246:13 (clickhouse+0x17ca9ace)
    #8 decltype(std::declval<DB::MetricLog::startCollectMetric(unsigned long)::$_0>()()) std::__1::__invoke[abi:v15000]<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::MetricLog::startCollectMetric(unsigned long)::$_0>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&)::'lambda'()&>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394:23 (clickhouse+0x17ca9ace)
    #9 void std::__1::__invoke_void_return_wrapper<void, true>::__call<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::MetricLog::startCollectMetric(unsigned long)::$_0>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&)::'lambda'()&>(ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::MetricLog::startCollectMetric(unsigned long)::$_0>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&)::'lambda'()&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:479:9 (clickhouse+0x17ca9ace)
    #10 std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::MetricLog::startCollectMetric(unsigned long)::$_0>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&)::'lambda'(), void ()>::operator()[abi:v15000]() build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:235:12 (clickhouse+0x17ca9ace)
    #11 void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<DB::MetricLog::startCollectMetric(unsigned long)::$_0>(DB::MetricLog::startCollectMetric(unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*) build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:716:16 (clickhouse+0x17ca9ace)
    #12 std::__1::__function::__policy_func<void ()>::operator()[abi:v15000]() const build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:848:16 (clickhouse+0xf0792ee) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #13 std::__1::function<void ()>::operator()() const build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:1187:12 (clickhouse+0xf0792ee)
    #14 ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) build_docker/./src/Common/ThreadPool.cpp:458:13 (clickhouse+0xf0792ee)
    #15 void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()::operator()() const build_docker/./src/Common/ThreadPool.cpp:220:73 (clickhouse+0xf07fad1) (BuildId: 9bd545107f3d5438397d7db095daa5a3517a3edc)
    #16 decltype(std::declval<void>()()) std::__1::__invoke[abi:v15000]<void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394:23 (clickhouse+0xf07fad1)
    #17 void std::__1::__thread_execute[abi:v15000]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>(std::__1::tuple<void, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>&, std::__1::__tuple_indices<>) build_docker/./contrib/llvm-project/libcxx/include/thread:284:5 (clickhouse+0xf07fad1)
    #18 void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, Priority, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*) build_docker/./contrib/llvm-project/libcxx/include/thread:295:5 (clickhouse+0xf07fad1)

@alexkats alexkats force-pushed the drop-connections branch 3 times, most recently from 0b94adb to 418dbf6 Compare May 14, 2024 08:49
@alexkats alexkats force-pushed the drop-connections branch 4 times, most recently from 0431684 to 118dcfe Compare May 16, 2024 11:13
@alexkats alexkats force-pushed the drop-connections branch from 118dcfe to 9632f2e Compare May 26, 2024 16:36
@alexkats alexkats force-pushed the drop-connections branch 4 times, most recently from 433a357 to a4702fe Compare June 21, 2024 14:05
@alexkats alexkats changed the title [WIP] Drop connections when the server is overloaded Drop connections when the server is overloaded Jun 21, 2024
@alexkats alexkats force-pushed the drop-connections branch 2 times, most recently from 30d0ca4 to f5c42e0 Compare June 21, 2024 18:46
@alexkats alexkats force-pushed the drop-connections branch 3 times, most recently from c4bb031 to 83ed02e Compare June 27, 2024 15:31
@alexkats
Copy link
Copy Markdown
Contributor Author

alexkats commented Apr 1, 2025

@alexkats,

  1. Move the settings to the query level if possible. It means, thresholds can be adapted dynamically on a query level.
  2. For tests, make it as a smoke test - that will either quickly throw when the thresholds are very low, or do it with a failpoint that artificially inflates the metric value.
  1. Agreed, done
  2. I created some pretty simple integration test for such a scenario

@alexkats
Copy link
Copy Markdown
Contributor Author

alexkats commented Apr 1, 2025

I'll wait for the CI to take a look that there are no unexpected SERVER_OVERLOADED errors in tests. But in general this PR is ready

@alexkats alexkats marked this pull request as ready for review April 1, 2025 17:35
@alexkats alexkats force-pushed the drop-connections branch 2 times, most recently from 8a82318 to 9e98d9e Compare April 2, 2025 14:00
@alexkats alexkats changed the title Drop connections when the server is overloaded Reject queries when the server is overloaded Apr 5, 2025
@alexkats
Copy link
Copy Markdown
Contributor Author

alexkats commented Apr 5, 2025

AST fuzzer (tsan) - #76496
Stateless tests (release, old analyzer, s3, DatabaseReplicated, 2/2) - #70319

@alexkats alexkats added this pull request to the merge queue Apr 7, 2025
Merged via the queue into ClickHouse:master with commit 985048c Apr 7, 2025
115 of 119 checks passed
@alexkats alexkats deleted the drop-connections branch April 7, 2025 13:39
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 7, 2025
@tluchowski
Copy link
Copy Markdown

How can I disable this functionality? I started getting SERVER_OVERLOADED errors because of this. I understand the idea but currently it causes me problems so I don't want to have it on by default.

Do I need to set min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw to some values so that the condition is never met, or is there a cleaner approach?

I'd appreciate a config snippet I can include to disable this completely.

@alexkats
Copy link
Copy Markdown
Contributor Author

alexkats commented May 5, 2025

How can I disable this functionality? I started getting SERVER_OVERLOADED errors because of this. I understand the idea but currently it causes me problems so I don't want to have it on by default.

Do I need to set min_os_cpu_wait_time_ratio_to_throw and max_os_cpu_wait_time_ratio_to_throw to some values so that the condition is never met, or is there a cleaner approach?

I'd appreciate a config snippet I can include to disable this completely.

You can just set values for both to 0. Also I created a change #79052 to increase the defaults

@tluchowski
Copy link
Copy Markdown

Thank you! I understand that I won't be bitten by trying to divide 0 by 0? :-) can you share a config snippet for this? I want to be absolutely sure I get it right the first time :)

@alexkats
Copy link
Copy Markdown
Contributor Author

alexkats commented May 5, 2025

Thank you! I understand that I won't be bitten by trying to divide 0 by 0? :-) can you share a config snippet for this? I want to be absolutely sure I get it right the first time :)

It'll be ok due to earlier check between min and max ratio. Regarding the snippet, it's a pretty straightforward user settings change (https://clickhouse.com/docs/operations/configuration-files#user-settings). It can look smth like this:

<clickhouse>
  <profiles>
    <default>
      <min_os_cpu_wait_time_ratio_to_throw>0</min_os_cpu_wait_time_ratio_to_throw>
      <max_os_cpu_wait_time_ratio_to_throw>0</max_os_cpu_wait_time_ratio_to_throw>
    </default>
  </profiles>
</clickhouse>

@tluchowski
Copy link
Copy Markdown

Awesome, thank you!

nikitamikhaylov added a commit that referenced this pull request May 6, 2025
This reverts commit 985048c, reversing
changes made to 5a030f8.
@tluchowski
Copy link
Copy Markdown

Thank you! I understand that I won't be bitten by trying to divide 0 by 0? :-) can you share a config snippet for this? I want to be absolutely sure I get it right the first time :)

It'll be ok due to earlier check between min and max ratio. Regarding the snippet, it's a pretty straightforward user settings change (https://clickhouse.com/docs/operations/configuration-files#user-settings). It can look smth like this:

<clickhouse>
  <profiles>
    <default>
      <min_os_cpu_wait_time_ratio_to_throw>0</min_os_cpu_wait_time_ratio_to_throw>
      <max_os_cpu_wait_time_ratio_to_throw>0</max_os_cpu_wait_time_ratio_to_throw>
    </default>
  </profiles>
</clickhouse>

Are you 100% sure this config snippet works? I copied it, pasted to /etc/clickhouse-server/config.d/server_overload_disable.xml, restarted the system but I still got:

2025.05.06 05:54:31.380981 [ 65144 ] {4d3e3416-c9e6-4249-866a-317334462d3b} DynamicQueryHandler: Code: 745. DB::Exception: CPU is overloaded, CPU is waiting for execution way more than executing, ratio of wait time (OSCPUWaitMicroseconds metric) to busy time (OSCPUVirtualTimeMicroseconds metric) is 3.2004100897425842. Min ratio for error (min_os_cpu_wait_time_ratio_to_throw setting) 2, max ratio for error (max_os_cpu_wait_time_ratio_to_throw setting) 6, probability used to decide whether to discard the query 0.30010252243564606. Consider reducing the number of queries or increase backoff between retries. (SERVER_OVERLOADED), Stack trace (when copying this message, always include the lines below):

  1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000f2bff3b
  2. DB::Exception::Exception(PreformattedMessage const&, int) @ 0x000000000f34540c
  3. ProfileEvents::checkCPUOverload(long, double, double, bool) @ 0x000000000f345075
  4. DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::shared_ptrDB::Context, std::function<void (DB::QueryResultDetails const&)>, DB::QueryFlags, std::optionalDB::FormatSettings const&, std::function<void (DB::IOutputFormat&, String const&, std::shared_ptr<DB::Context const> const&, std::optionalDB::FormatSettings const&)>) @ 0x000000001359b13b
  5. DB::HTTPHandler::processQuery(DB::HTTPServerRequest&, DB::HTMLForm&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&, std::optionalDB::CurrentThread::QueryScope&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x0000000014861401
  6. DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x0000000014865969
  7. DB::HTTPServerConnection::run() @ 0x000000001491bb2a
  8. Poco::Net::TCPServerConnection::start() @ 0x00000000180162e7
  9. Poco::Net::TCPServerDispatcher::run() @ 0x0000000018016739
  10. Poco::PooledThread::run() @ 0x0000000017fe1a5b
  11. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000017fdff3d
  12. ? @ 0x00007f94470897fa
  13. ? @ 0x00007f944710e820
    (version 25.4.2.31 (official build))

@tluchowski
Copy link
Copy Markdown

tluchowski commented May 6, 2025

Thank you! I understand that I won't be bitten by trying to divide 0 by 0? :-) can you share a config snippet for this? I want to be absolutely sure I get it right the first time :)

It'll be ok due to earlier check between min and max ratio. Regarding the snippet, it's a pretty straightforward user settings change (https://clickhouse.com/docs/operations/configuration-files#user-settings). It can look smth like this:

<clickhouse>
  <profiles>
    <default>
      <min_os_cpu_wait_time_ratio_to_throw>0</min_os_cpu_wait_time_ratio_to_throw>
      <max_os_cpu_wait_time_ratio_to_throw>0</max_os_cpu_wait_time_ratio_to_throw>
    </default>
  </profiles>
</clickhouse>

Are you 100% sure this config snippet works? I copied it, pasted to /etc/clickhouse-server/config.d/server_overload_disable.xml, restarted the system but I still got:

2025.05.06 05:54:31.380981 [ 65144 ] {4d3e3416-c9e6-4249-866a-317334462d3b} DynamicQueryHandler: Code: 745. DB::Exception: CPU is overloaded, CPU is waiting for execution way more than executing, ratio of wait time (OSCPUWaitMicroseconds metric) to busy time (OSCPUVirtualTimeMicroseconds metric) is 3.2004100897425842. Min ratio for error (min_os_cpu_wait_time_ratio_to_throw setting) 2, max ratio for error (max_os_cpu_wait_time_ratio_to_throw setting) 6, probability used to decide whether to discard the query 0.30010252243564606. Consider reducing the number of queries or increase backoff between retries. (SERVER_OVERLOADED), Stack trace (when copying this message, always include the lines below):

  1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000f2bff3b
  2. DB::Exception::Exception(PreformattedMessage const&, int) @ 0x000000000f34540c
  3. ProfileEvents::checkCPUOverload(long, double, double, bool) @ 0x000000000f345075
  4. DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::shared_ptrDB::Context, std::function<void (DB::QueryResultDetails const&)>, DB::QueryFlags, std::optionalDB::FormatSettings const&, std::function<void (DB::IOutputFormat&, String const&, std::shared_ptr<DB::Context const> const&, std::optionalDB::FormatSettings const&)>) @ 0x000000001359b13b
  5. DB::HTTPHandler::processQuery(DB::HTTPServerRequest&, DB::HTMLForm&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&, std::optionalDB::CurrentThread::QueryScope&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x0000000014861401
  6. DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x0000000014865969
  7. DB::HTTPServerConnection::run() @ 0x000000001491bb2a
  8. Poco::Net::TCPServerConnection::start() @ 0x00000000180162e7
  9. Poco::Net::TCPServerDispatcher::run() @ 0x0000000018016739
  10. Poco::PooledThread::run() @ 0x0000000017fe1a5b
  11. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000017fdff3d
  12. ? @ 0x00007f94470897fa
  13. ? @ 0x00007f944710e820
    (version 25.4.2.31 (official build))

When I check in clickhouse-client:

SELECT * FROM system.settings WHERE name LIKE '%cpu_wait%'

Row 1:
──────
name: min_os_cpu_wait_time_ratio_to_throw
value: 2
changed: 0
description: Min ratio between OS CPU wait (OSCPUWaitMicroseconds metric) and busy (OSCPUVirtualTimeMicroseconds metric) times to consider rejecting queries. Linear interpolation between min and max ratio is used to calculate the probability, the probability is 0 at this point.
min: ᴺᵁᴸᴸ
max: ᴺᵁᴸᴸ
readonly: 0
type: Float
default: 2
alias_for:
is_obsolete: 0
tier: Production

Row 2:
──────
name: max_os_cpu_wait_time_ratio_to_throw
value: 6
changed: 0
description: Max ratio between OS CPU wait (OSCPUWaitMicroseconds metric) and busy (OSCPUVirtualTimeMicroseconds metric) times to consider rejecting queries. Linear interpolation between min and max ratio is used to calculate the probability, the probability is 1 at this point.
min: ᴺᵁᴸᴸ
max: ᴺᵁᴸᴸ
readonly: 0
type: Float
default: 6
alias_for:
is_obsolete: 0
tier: Production

@tluchowski
Copy link
Copy Markdown

Thank you! I understand that I won't be bitten by trying to divide 0 by 0? :-) can you share a config snippet for this? I want to be absolutely sure I get it right the first time :)

It'll be ok due to earlier check between min and max ratio. Regarding the snippet, it's a pretty straightforward user settings change (https://clickhouse.com/docs/operations/configuration-files#user-settings). It can look smth like this:

<clickhouse>
  <profiles>
    <default>
      <min_os_cpu_wait_time_ratio_to_throw>0</min_os_cpu_wait_time_ratio_to_throw>
      <max_os_cpu_wait_time_ratio_to_throw>0</max_os_cpu_wait_time_ratio_to_throw>
    </default>
  </profiles>
</clickhouse>

Are you 100% sure this config snippet works? I copied it, pasted to /etc/clickhouse-server/config.d/server_overload_disable.xml, restarted the system but I still got:
2025.05.06 05:54:31.380981 [ 65144 ] {4d3e3416-c9e6-4249-866a-317334462d3b} DynamicQueryHandler: Code: 745. DB::Exception: CPU is overloaded, CPU is waiting for execution way more than executing, ratio of wait time (OSCPUWaitMicroseconds metric) to busy time (OSCPUVirtualTimeMicroseconds metric) is 3.2004100897425842. Min ratio for error (min_os_cpu_wait_time_ratio_to_throw setting) 2, max ratio for error (max_os_cpu_wait_time_ratio_to_throw setting) 6, probability used to decide whether to discard the query 0.30010252243564606. Consider reducing the number of queries or increase backoff between retries. (SERVER_OVERLOADED), Stack trace (when copying this message, always include the lines below):

  1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000f2bff3b
  2. DB::Exception::Exception(PreformattedMessage const&, int) @ 0x000000000f34540c
  3. ProfileEvents::checkCPUOverload(long, double, double, bool) @ 0x000000000f345075
  4. DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::shared_ptrDB::Context, std::function<void (DB::QueryResultDetails const&)>, DB::QueryFlags, std::optionalDB::FormatSettings const&, std::function<void (DB::IOutputFormat&, String const&, std::shared_ptr<DB::Context const> const&, std::optionalDB::FormatSettings const&)>) @ 0x000000001359b13b
  5. DB::HTTPHandler::processQuery(DB::HTTPServerRequest&, DB::HTMLForm&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&, std::optionalDB::CurrentThread::QueryScope&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x0000000014861401
  6. DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&, StrongTypedef<unsigned long, ProfileEvents::EventTag> const&) @ 0x0000000014865969
  7. DB::HTTPServerConnection::run() @ 0x000000001491bb2a
  8. Poco::Net::TCPServerConnection::start() @ 0x00000000180162e7
  9. Poco::Net::TCPServerDispatcher::run() @ 0x0000000018016739
  10. Poco::PooledThread::run() @ 0x0000000017fe1a5b
  11. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000017fdff3d
  12. ? @ 0x00007f94470897fa
  13. ? @ 0x00007f944710e820
    (version 25.4.2.31 (official build))

When I check in clickhouse-client:

SELECT * FROM system.settings WHERE name LIKE '%cpu_wait%'

Row 1: ────── name: min_os_cpu_wait_time_ratio_to_throw value: 2 changed: 0 description: Min ratio between OS CPU wait (OSCPUWaitMicroseconds metric) and busy (OSCPUVirtualTimeMicroseconds metric) times to consider rejecting queries. Linear interpolation between min and max ratio is used to calculate the probability, the probability is 0 at this point. min: ᴺᵁᴸᴸ max: ᴺᵁᴸᴸ readonly: 0 type: Float default: 2 alias_for: is_obsolete: 0 tier: Production

Row 2: ────── name: max_os_cpu_wait_time_ratio_to_throw value: 6 changed: 0 description: Max ratio between OS CPU wait (OSCPUWaitMicroseconds metric) and busy (OSCPUVirtualTimeMicroseconds metric) times to consider rejecting queries. Linear interpolation between min and max ratio is used to calculate the probability, the probability is 1 at this point. min: ᴺᵁᴸᴸ max: ᴺᵁᴸᴸ readonly: 0 type: Float default: 6 alias_for: is_obsolete: 0 tier: Production

Ah I should've put it in /etc/clickhouse-server/users.d, not config.d

Now it works fine, apologies for the noise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants