Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA#96117
Closed
alexey-milovidov wants to merge 1 commit intomasterfrom
Closed
Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA#96117alexey-milovidov wants to merge 1 commit intomasterfrom
alexey-milovidov wants to merge 1 commit intomasterfrom
Conversation
The old DDL worker could reconnect to ZooKeeper during `restoreDatabaseMetadataInKeeper` and read the intermediate state where table metadata had not been committed yet, causing it to mistakenly move local tables to `_broken_replicated_tables`. The fix stops the DDL worker before restoring metadata in Keeper, so it cannot race with the restore process. `reinitializeDDLWorker` at the end of the function will start a fresh DDL worker after the metadata has been fully written. Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96098&sha=a729564c75b461247048467c068bae40e8d18d18&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%205%2F5%29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Workflow [PR], commit [bd3ab70] Summary: ❌
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition between the DDL worker and the SYSTEM RESTORE DATABASE REPLICA command. The DDL worker could reconnect to ZooKeeper during metadata restoration and read an incomplete state, causing it to incorrectly move tables to the broken tables directory.
Changes:
- Added DDL worker shutdown logic before restoring database metadata in Keeper
- Ensured the DDL worker cannot interfere with the metadata restoration process
Contributor
|
the test only fails twice in last 30 days. |
Member
Author
|
Superseded by #95865, which includes this fix with an additional |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The old DDL worker could reconnect to ZooKeeper during
restoreDatabaseMetadataInKeeperand read the intermediate state where table metadata had not been committed yet, causing it to mistakenly move local tables to_broken_replicated_tables.The fix stops the DDL worker before restoring metadata in Keeper, so it cannot race with the restore process.
reinitializeDDLWorkerat the end of the function will start a fresh DDL worker after the metadata has been fully written.Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96098&sha=a729564c75b461247048467c068bae40e8d18d18&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%205%2F5%29
Changelog category (leave one):