Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL#95865
Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL#95865alexey-milovidov merged 8 commits intomasterfrom
Conversation
The function accessed `ddl_worker->getCommonHostID()` without holding `ddl_worker_mutex` and without checking if `ddl_worker` is null. Another thread could set `ddl_worker` to nullptr during shutdown or reinitialization, causing a segmentation fault. Added proper locking and null check following the pattern used in other methods like `waitForReplicaToProcessAllEntries`. Fixes #95352 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Workflow [PR], commit [409b6a5] Summary: ❌
|
When we reach this point, |
The original fix placed the `ddl_worker` null check at the beginning of `tryEnqueueReplicatedDDL`, which changed the error message users see when a database with a dropped replica tries to execute DDL. The test `02447_drop_database_replica` expects "Database is in readonly mode" error message in this scenario. This commit moves the `is_readonly` check to execute before the `ddl_worker` null check, preserving the expected error message while still protecting against the race condition that caused the original segfault during concurrent shutdown. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
TLDR it's concurrent shutdown:
|
Maybe it is reset concurrently somehow? |
|
This might need more investigation. |
Stop the DDL worker before restoring database metadata in Keeper to prevent a race condition: the old DDL worker may reconnect to ZooKeeper, read the intermediate state (where table metadata has not been written yet), and mistakenly move local tables to `_broken_replicated_tables`. Use a `SCOPE_EXIT` to reinitialize the DDL worker if the restore fails, so that the database remains functional. Without this, a failed `SYSTEM RESTORE DATABASE REPLICA` (e.g. when the replica already has valid metadata) would leave `ddl_worker` as nullptr, causing a segfault on any subsequent DDL query. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…/ClickHouse into fix-ddl-worker-race-condition
|
So the problem is that it is reset concurrently, in the case of Replicated database. |
|
There are failures in this code every day: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=3092b87d349f70ca8f607553c34df2fc4d53125a&name_0=MasterCI&name_1=Stress%20test%20%28amd_tsan%29&name_1=Stress%20test%20%28amd_tsan%29 We should merge asap. |
|
At least it does not harm... |
|
When restore the DB, it creates DB nodes (e.g, metadata, max_log_ptr) first, then, create table metadata nodes later. If creating table metadata nodes throws an exception, the DB will be up again with empty table. I tried to make all the operation (creating DB node, table metadata nodes) atomically in this PR: #94772 |
PR #95865 fixed the main crash in `tryEnqueueReplicatedDDL` by moving `getCommonHostID` inside the lock, but several other functions still access `ddl_worker` outside the lock after checking it, which can race with `shutdown` setting it to nullptr. Change `ddl_worker` from `unique_ptr` to `shared_ptr` and take local copies under `ddl_worker_mutex` in functions that use it after releasing the lock: - `tryEnqueueReplicatedDDL`: `tryEnqueueAndExecuteEntry` was still called on `ddl_worker` outside the lock - `waitForReplicaToProcessAllEntries`: all three calls outside the lock - `tryGetReplicasInfo`: `isUnsyncedAfterRecovery` outside the lock - `checkDigestValid`: `isCurrentlyActive` without any lock - `dropTable`, `commitCreateTable`, `commitAlterTable`, `detachTablePermanently`, `removeDetachedPermanentlyFlag`: assert accesses that race in debug/sanitizer builds Closes #85774 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes two race conditions involving
ddl_workerinDatabaseReplicated:Issue 1: Race between DDL queries and
shutdown(stress tests)tryEnqueueReplicatedDDLaccessedddl_worker->getCommonHostID()without holdingddl_worker_mutexand without checking ifddl_workeris null. Another thread could setddl_workertonullptrduringshutdown(e.g., when dropping the database), causing a segmentation fault at address0x77— the offset of thehost_fqdn_idmember from a nullDDLWorkerpointer.The fix adds proper locking and a null check following the pattern used in other methods like
waitForReplicaToProcessAllEntries.Root cause of the recent spike in failures: The bug existed for a long time but was invisible because PR #95081 (merged Jan 25) fixed a critical bug in
ci/jobs/stress_job.pywhere stress tests were always reporting SUCCESS regardless of actual failures — the code was iterating an empty shadowed variable instead of the real test results list. Once the reporting was fixed, the pre-existing crash became visible in CI, jumping from ~5/week to ~32/week.Issue 2: Failed
SYSTEM RESTORE DATABASE REPLICAleavesddl_workerpermanently nullrestoreDatabaseMetadataInKeepernow stops the DDL worker at the beginning (to prevent a race where the old DDL worker reconnects to ZooKeeper and reads intermediate state during metadata restoration). However, if the restore fails (e.g., when the replica already has valid metadata —REPLICA_ALREADY_EXISTS),reinitializeDDLWorkerwas never called, leavingddl_workerasnullptrpermanently. Any subsequent DDL query on the database would segfault.The fix uses
SCOPE_EXITto ensure the DDL worker is always reinitialized if the restore fails, keeping the database functional.This subsumes and supersedes PR #96117.
Closes #95352.
Changelog category (leave one):