Skip to content

Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA#96117

Closed
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-restore-db-replica-race
Closed

Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA#96117
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-restore-db-replica-race

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

The old DDL worker could reconnect to ZooKeeper during restoreDatabaseMetadataInKeeper and read the intermediate state where table metadata had not been committed yet, causing it to mistakenly move local tables to _broken_replicated_tables.

The fix stops the DDL worker before restoring metadata in Keeper, so it cannot race with the restore process. reinitializeDDLWorker at the end of the function will start a fresh DDL worker after the metadata has been fully written.

Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96098&sha=a729564c75b461247048467c068bae40e8d18d18&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%205%2F5%29

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

The old DDL worker could reconnect to ZooKeeper during
`restoreDatabaseMetadataInKeeper` and read the intermediate state where
table metadata had not been committed yet, causing it to mistakenly
move local tables to `_broken_replicated_tables`.

The fix stops the DDL worker before restoring metadata in Keeper, so it
cannot race with the restore process. `reinitializeDDLWorker` at the end
of the function will start a fresh DDL worker after the metadata has
been fully written.

Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96098&sha=a729564c75b461247048467c068bae40e8d18d18&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%205%2F5%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 5, 2026

Workflow [PR], commit [bd3ab70]

Summary:

job_name test_name status info comment
Integration tests (amd_asan, db disk, old analyzer, 5/6) failure
test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica FAIL cidb
test_restore_db_replica/test.py::test_restore_db_replica_on_cluster FAIL cidb
Integration tests (amd_binary, 5/5) failure
test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica FAIL cidb
test_restore_db_replica/test.py::test_restore_db_replica_on_cluster FAIL cidb
Integration tests (arm_binary, distributed plan, 3/4) failure
test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica FAIL cidb
test_restore_db_replica/test.py::test_restore_db_replica_on_cluster FAIL cidb
Integration tests (amd_tsan, 5/6) failure
test_restore_db_replica/test.py::test_failed_restore_db_replica_on_normal_replica FAIL cidb
test_restore_db_replica/test.py::test_restore_db_replica_on_cluster FAIL cidb
Stress test (amd_debug) failure
Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-38e6) FAIL cidb

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition between the DDL worker and the SYSTEM RESTORE DATABASE REPLICA command. The DDL worker could reconnect to ZooKeeper during metadata restoration and read an incomplete state, causing it to incorrectly move tables to the broken tables directory.

Changes:

  • Added DDL worker shutdown logic before restoring database metadata in Keeper
  • Ensured the DDL worker cannot interfere with the metadata restoration process

@tiandiwonder
Copy link
Copy Markdown
Contributor

the test only fails twice in last 30 days.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

Superseded by #95865, which includes this fix with an additional SCOPE_EXIT to reinitialize the DDL worker if the restore fails (otherwise a failed SYSTEM RESTORE DATABASE REPLICA leaves ddl_worker permanently null, causing segfaults on subsequent DDL queries — as observed in CI with test_failed_restore_db_replica_on_normal_replica).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants