Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA by alexey-milovidov · Pull Request #96117 · ClickHouse/ClickHouse

alexey-milovidov · 2026-02-05T21:16:39Z

The old DDL worker could reconnect to ZooKeeper during restoreDatabaseMetadataInKeeper and read the intermediate state where table metadata had not been committed yet, causing it to mistakenly move local tables to _broken_replicated_tables.

The fix stops the DDL worker before restoring metadata in Keeper, so it cannot race with the restore process. reinitializeDDLWorker at the end of the function will start a fresh DDL worker after the metadata has been fully written.

Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96098&sha=a729564c75b461247048467c068bae40e8d18d18&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%205%2F5%29

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

The old DDL worker could reconnect to ZooKeeper during `restoreDatabaseMetadataInKeeper` and read the intermediate state where table metadata had not been committed yet, causing it to mistakenly move local tables to `_broken_replicated_tables`. The fix stops the DDL worker before restoring metadata in Keeper, so it cannot race with the restore process. `reinitializeDDLWorker` at the end of the function will start a fresh DDL worker after the metadata has been fully written. Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96098&sha=a729564c75b461247048467c068bae40e8d18d18&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%205%2F5%29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

clickhouse-gh · 2026-02-05T21:17:21Z

Workflow [PR], commit [bd3ab70]

Summary: ❌

job_name	test_name	status	info
Integration tests (amd_asan, db disk, old analyzer, 5/6)		failure
	test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica	FAIL	cidb
	test_restore_db_replica/test.py::test_restore_db_replica_on_cluster	FAIL	cidb
Integration tests (amd_binary, 5/5)		failure
	test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica	FAIL	cidb
	test_restore_db_replica/test.py::test_restore_db_replica_on_cluster	FAIL	cidb
Integration tests (arm_binary, distributed plan, 3/4)		failure
	test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica	FAIL	cidb
	test_restore_db_replica/test.py::test_restore_db_replica_on_cluster	FAIL	cidb
Integration tests (amd_tsan, 5/6)		failure
	test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica	FAIL	cidb
	test_restore_db_replica/test.py::test_restore_db_replica_on_cluster	FAIL	cidb
Stress test (amd_debug)		failure
	Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-38e6)	FAIL	cidb

Copilot

Pull request overview

This PR fixes a race condition between the DDL worker and the SYSTEM RESTORE DATABASE REPLICA command. The DDL worker could reconnect to ZooKeeper during metadata restoration and read an incomplete state, causing it to incorrectly move tables to the broken tables directory.

Changes:

Added DDL worker shutdown logic before restoring database metadata in Keeper
Ensured the DDL worker cannot interfere with the metadata restoration process

tiandiwonder · 2026-02-06T01:01:22Z

the test only fails twice in last 30 days.

alexey-milovidov · 2026-02-06T03:56:08Z

Superseded by #95865, which includes this fix with an additional SCOPE_EXIT to reinitialize the DDL worker if the restore fails (otherwise a failed SYSTEM RESTORE DATABASE REPLICA leaves ddl_worker permanently null, causing segfaults on subsequent DDL queries — as observed in CI with test_failed_restore_db_replica_on_normal_replica).

clickhouse-gh bot added the pr-ci label Feb 5, 2026

tiandiwonder requested review from Copilot and tiandiwonder February 6, 2026 00:30

Copilot AI reviewed Feb 6, 2026

View reviewed changes

alexey-milovidov mentioned this pull request Feb 6, 2026

Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL #95865

Merged

alexey-milovidov closed this Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA#96117

Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA#96117
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-restore-db-replica-race

alexey-milovidov commented Feb 5, 2026

Uh oh!

clickhouse-gh bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

tiandiwonder commented Feb 6, 2026

Uh oh!

alexey-milovidov commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alexey-milovidov commented Feb 5, 2026

Changelog category (leave one):

Uh oh!

clickhouse-gh bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tiandiwonder commented Feb 6, 2026

Uh oh!

alexey-milovidov commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh bot commented Feb 5, 2026 •

edited

Loading