Harden test_ddl_worker_with_loopback_hosts against CI flakes#88342
Conversation
…ng for ZooKeeper readiness, increasing timeouts & handling transient failures
| return | ||
| time.sleep(POLL_INTERVAL) | ||
|
|
||
| # Last-chance idempotent nudge to converge |
There was a problem hiding this comment.
Question: Why add this final try? Is the above not good enough? Nevermind, I misunderstood the test
| "Code: 159", | ||
| "Code: 32", |
There was a problem hiding this comment.
Better to use human names instead of codes
There was a problem hiding this comment.
However, the codes won't change, making it more reliable for the test.
|
|
||
| # Last-chance idempotent nudge to converge | ||
| if create_sql.startswith("CREATE TABLE "): | ||
| safe_sql = create_sql.replace("CREATE TABLE ", "CREATE TABLE IF NOT EXISTS ", 1) |
There was a problem hiding this comment.
Maybe we should just modify the orignal queries?
There was a problem hiding this comment.
If the query is modified, we must also update check_fn accordingly.
Default value is 180 |
But why this happens? |
For this test, we used to explicitly set a 10-second deadline timeout. |
Code:32 was caused by the client losing connection while ZooKeeper/DDL worker were still stabilizing after startup. We now wait for ZooKeeper readiness before running the first ON CLUSTER and poll for convergence, so this transient EOF won’t happen again. |
| # Last-chance idempotent nudge to converge | ||
| if create_sql.startswith("CREATE TABLE "): | ||
| safe_sql = create_sql.replace("CREATE TABLE ", "CREATE TABLE IF NOT EXISTS ", 1) | ||
| node_issuer.query( |
There was a problem hiding this comment.
@zlareb1 how retry of modified query helps? The same way you could just run the query in loop. This change breaks test and hides real bug.
2025.10.13 12:18:36.717663 [ 681 ] {} <Error> DDLWorker: ZooKeeper error: Code: 999. Coordination::Exception: Transaction failed (Node exists): Op #0, path: /clickhouse/task_queue/replicas/127%2E0%2E0%2E1:9000/active. (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
0. ./contrib/llvm-project/libcxx/include/__exception/exception.h:113: Poco::Exception::Exception(String const&, int) @ 0x0000000029f56fa0
1. ./ci/tmp/build/./src/Common/Exception.cpp:129: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013799574
2. DB::Exception::Exception(String&&, int, String, bool) @ 0x00000000093559ee
3. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x00000000093553f0
4. ./src/Common/Exception.h:141: DB::Exception::Exception<Coordination::Error&, unsigned long&>(int, FormatStringHelperImpl<std::type_identity<Coordination::Error&>::type, std::type_identity<unsigned long&>::type>, Coordination::Error&, unsigned long&) @ 0x0000000023642906
5. ./src/Common/ZooKeeper/KeeperException.h:38: zkutil::KeeperMultiException::KeeperMultiException(Coordination::Error, unsigned long, std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>> const&) @ 0x0000000023628861
6. ./ci/tmp/build/./src/Common/ZooKeeper/ZooKeeper.cpp:1610: zkutil::KeeperMultiException::KeeperMultiException(Coordination::Error, std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>> const&) @ 0x0000000023628d7f
7. ./ci/tmp/build/./src/Common/ZooKeeper/ZooKeeper.cpp:1627: zkutil::KeeperMultiException::check(Coordination::Error, std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>> const&) @ 0x000000002361a59d
8. ./ci/tmp/build/./src/Common/ZooKeeper/ZooKeeper.cpp:748: zkutil::ZooKeeper::multi(std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, bool) @ 0x000000002361a435
9. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1394: DB::DDLWorker::markReplicasActive(bool) @ 0x000000001c5bb8b4
10. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1145: DB::DDLWorker::initializeMainThread() @ 0x000000001c5b6457
11. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1212: DB::DDLWorker::runMainThread() @ 0x000000001c5985fa
12. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0x000000001c5bec05
13. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:149: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000001c5beb22
14. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x00000000139487e3
15. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x000000001395189c
16. __tsan_thread_start_func @ 0x00000000092c3418
17. ? @ 0x0000000000094ac3
18. ? @ 0x0000000000126850
(version 25.10.1.2384 (official build)). Failed to start DDLWorker.
2025.10.13 12:18:36.717780 [ 681 ] {} <Fatal> : Logical error: 'false'.
2025.10.13 12:18:36.752064 [ 681 ] {} <Fatal> : Stack trace (when copying this message, always include the lines below):
0. ./ci/tmp/build/./src/Common/StackTrace.cpp:395: StackTrace::StackTrace() @ 0x000000001385fd33
1. ./ci/tmp/build/./src/Common/Exception.cpp:57: DB::abortOnFailedAssertion(String const&) @ 0x0000000013798968
2. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1155: DB::DDLWorker::initializeMainThread() @ 0x000000001c5b693e
3. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1212: DB::DDLWorker::runMainThread() @ 0x000000001c5985fa
4. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0x000000001c5bec05
5. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:149: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000001c5beb22
6. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x00000000139487e3
7. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x000000001395189c
8. __tsan_thread_start_func @ 0x00000000092c3418
9. ? @ 0x0000000000094ac3
10. ? @ 0x0000000000126850
There was a problem hiding this comment.
@alesapin, the issue this test encounters appears to be related to resource problems in ZK on CI (not in our control).
Eventual goal for this test is to validate test_loopback_cluster1 has a loopback host, only 1 replica should process the query for which we have correct assertions.
There was a problem hiding this comment.
Hi @zlareb1 can you please elaborate on resource problems in ZK on CI? What does it mean? How is it related to stack traces I sent? Is it related to zookeeper or clickhouse-keeper?
Harden tests/integration/test_ddl_worker_with_loopback_hosts/test.py against CI flakes by:
Reason
Two CI runs failed on the same statement:
CREATE TABLE t1 ON CLUSTER 'test_cluster' (x INT) ENGINE=MergeTree() ORDER BY xCode: 32 ATTEMPT_TO_READ_AFTER_EOFwhile ZooKeeper was repeatedly connecting/dropping.Code: 159 TIMEOUT_EXCEEDEDafter 10.5759s with the server stating the task will execute in the background.This is a timing/race with ZK/DDL worker stabilization under CI resource jitter, not a product defect.
Closes #88328
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes