Skip to content

Harden test_ddl_worker_with_loopback_hosts against CI flakes#88342

Merged
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
zlareb1:fix/test_ddl_worker_with_loopback_hosts
Oct 11, 2025
Merged

Harden test_ddl_worker_with_loopback_hosts against CI flakes#88342
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
zlareb1:fix/test_ddl_worker_with_loopback_hosts

Conversation

@zlareb1
Copy link
Copy Markdown
Member

@zlareb1 zlareb1 commented Oct 10, 2025

Harden tests/integration/test_ddl_worker_with_loopback_hosts/test.py against CI flakes by:

  • Waiting for ZooKeeper readiness on both nodes before the first ON CLUSTER.
  • Increased distributed_ddl_task_timeout from 10 to 60s.
  • Treating Code: 159 (background continue) and Code: 32 (early EOF) as transient and polling for the expected end state.
  • Adding a single idempotent IF NOT EXISTS retry.

Reason
Two CI runs failed on the same statement:
CREATE TABLE t1 ON CLUSTER 'test_cluster' (x INT) ENGINE=MergeTree() ORDER BY x

  • Run A: Code: 32 ATTEMPT_TO_READ_AFTER_EOF while ZooKeeper was repeatedly connecting/dropping.
  • Run B: Code: 159 TIMEOUT_EXCEEDED after 10.5759s with the server stating the task will execute in the background.

This is a timing/race with ZK/DDL worker stabilization under CI resource jitter, not a product defect.

Closes #88328

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

…ng for ZooKeeper readiness, increasing timeouts & handling transient failures
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Oct 10, 2025

Workflow [PR], commit [ff1e810]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Oct 10, 2025
@george-larionov george-larionov self-assigned this Oct 10, 2025
return
time.sleep(POLL_INTERVAL)

# Last-chance idempotent nudge to converge
Copy link
Copy Markdown
Member

@george-larionov george-larionov Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why add this final try? Is the above not good enough? Nevermind, I misunderstood the test

Copy link
Copy Markdown
Member

@george-larionov george-larionov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Oct 11, 2025
Merged via the queue into ClickHouse:master with commit e423373 Oct 11, 2025
123 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 11, 2025
Comment on lines +82 to +83
"Code: 159",
"Code: 32",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use human names instead of codes

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the codes won't change, making it more reliable for the test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zlareb1, are you sure?


# Last-chance idempotent nudge to converge
if create_sql.startswith("CREATE TABLE "):
safe_sql = create_sql.replace("CREATE TABLE ", "CREATE TABLE IF NOT EXISTS ", 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just modify the orignal queries?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the query is modified, we must also update check_fn accordingly.

@azat
Copy link
Copy Markdown
Member

azat commented Oct 13, 2025

Increased distributed_ddl_task_timeout from 10 to 60s.

Default value is 180

@azat
Copy link
Copy Markdown
Member

azat commented Oct 13, 2025

Run A: Code: 32 ATTEMPT_TO_READ_AFTER_EOF while ZooKeeper was repeatedly connecting/dropping.

But why this happens?

@zlareb1
Copy link
Copy Markdown
Member Author

zlareb1 commented Oct 13, 2025

Increased distributed_ddl_task_timeout from 10 to 60s.

Default value is 180

For this test, we used to explicitly set a 10-second deadline timeout.

@zlareb1
Copy link
Copy Markdown
Member Author

zlareb1 commented Oct 13, 2025

Run A: Code: 32 ATTEMPT_TO_READ_AFTER_EOF while ZooKeeper was repeatedly connecting/dropping.

But why this happens?

Code:32 was caused by the client losing connection while ZooKeeper/DDL worker were still stabilizing after startup. We now wait for ZooKeeper readiness before running the first ON CLUSTER and poll for convergence, so this transient EOF won’t happen again.

# Last-chance idempotent nudge to converge
if create_sql.startswith("CREATE TABLE "):
safe_sql = create_sql.replace("CREATE TABLE ", "CREATE TABLE IF NOT EXISTS ", 1)
node_issuer.query(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zlareb1 how retry of modified query helps? The same way you could just run the query in loop. This change breaks test and hides real bug.

2025.10.13 12:18:36.717663 [ 681 ] {} <Error> DDLWorker: ZooKeeper error: Code: 999. Coordination::Exception: Transaction failed (Node exists): Op #0, path: /clickhouse/task_queue/replicas/127%2E0%2E0%2E1:9000/active. (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):

0. ./contrib/llvm-project/libcxx/include/__exception/exception.h:113: Poco::Exception::Exception(String const&, int) @ 0x0000000029f56fa0
1. ./ci/tmp/build/./src/Common/Exception.cpp:129: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000013799574
2. DB::Exception::Exception(String&&, int, String, bool) @ 0x00000000093559ee
3. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x00000000093553f0
4. ./src/Common/Exception.h:141: DB::Exception::Exception<Coordination::Error&, unsigned long&>(int, FormatStringHelperImpl<std::type_identity<Coordination::Error&>::type, std::type_identity<unsigned long&>::type>, Coordination::Error&, unsigned long&) @ 0x0000000023642906
5. ./src/Common/ZooKeeper/KeeperException.h:38: zkutil::KeeperMultiException::KeeperMultiException(Coordination::Error, unsigned long, std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>> const&) @ 0x0000000023628861
6. ./ci/tmp/build/./src/Common/ZooKeeper/ZooKeeper.cpp:1610: zkutil::KeeperMultiException::KeeperMultiException(Coordination::Error, std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>> const&) @ 0x0000000023628d7f
7. ./ci/tmp/build/./src/Common/ZooKeeper/ZooKeeper.cpp:1627: zkutil::KeeperMultiException::check(Coordination::Error, std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>> const&) @ 0x000000002361a59d
8. ./ci/tmp/build/./src/Common/ZooKeeper/ZooKeeper.cpp:748: zkutil::ZooKeeper::multi(std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, bool) @ 0x000000002361a435
9. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1394: DB::DDLWorker::markReplicasActive(bool) @ 0x000000001c5bb8b4
10. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1145: DB::DDLWorker::initializeMainThread() @ 0x000000001c5b6457
11. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1212: DB::DDLWorker::runMainThread() @ 0x000000001c5985fa
12. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0x000000001c5bec05
13. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:149: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000001c5beb22
14. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x00000000139487e3
15. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x000000001395189c
16. __tsan_thread_start_func @ 0x00000000092c3418
17. ? @ 0x0000000000094ac3
18. ? @ 0x0000000000126850
 (version 25.10.1.2384 (official build)). Failed to start DDLWorker.
2025.10.13 12:18:36.717780 [ 681 ] {} <Fatal> : Logical error: 'false'.
2025.10.13 12:18:36.752064 [ 681 ] {} <Fatal> : Stack trace (when copying this message, always include the lines below):

0. ./ci/tmp/build/./src/Common/StackTrace.cpp:395: StackTrace::StackTrace() @ 0x000000001385fd33
1. ./ci/tmp/build/./src/Common/Exception.cpp:57: DB::abortOnFailedAssertion(String const&) @ 0x0000000013798968
2. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1155: DB::DDLWorker::initializeMainThread() @ 0x000000001c5b693e
3. ./ci/tmp/build/./src/Interpreters/DDLWorker.cpp:1212: DB::DDLWorker::runMainThread() @ 0x000000001c5985fa
4. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0x000000001c5bec05
5. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:149: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000001c5beb22
6. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x00000000139487e3
7. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x000000001395189c
8. __tsan_thread_start_func @ 0x00000000092c3418
9. ? @ 0x0000000000094ac3
10. ? @ 0x0000000000126850

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alesapin, the issue this test encounters appears to be related to resource problems in ZK on CI (not in our control).
Eventual goal for this test is to validate test_loopback_cluster1 has a loopback host, only 1 replica should process the query for which we have correct assertions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zlareb1 can you please elaborate on resource problems in ZK on CI? What does it mean? How is it related to stack traces I sent? Is it related to zookeeper or clickhouse-keeper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test ddl_worker_with_loopback_hosts

7 participants