Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL by alexey-milovidov · Pull Request #95865 · ClickHouse/ClickHouse

alexey-milovidov · 2026-02-02T23:45:26Z

Summary

Fixes two race conditions involving ddl_worker in DatabaseReplicated:

Issue 1: Race between DDL queries and `shutdown` (stress tests)

tryEnqueueReplicatedDDL accessed ddl_worker->getCommonHostID() without holding ddl_worker_mutex and without checking if ddl_worker is null. Another thread could set ddl_worker to nullptr during shutdown (e.g., when dropping the database), causing a segmentation fault at address 0x77 — the offset of the host_fqdn_id member from a null DDLWorker pointer.

The fix adds proper locking and a null check following the pattern used in other methods like waitForReplicaToProcessAllEntries.

Root cause of the recent spike in failures: The bug existed for a long time but was invisible because PR #95081 (merged Jan 25) fixed a critical bug in ci/jobs/stress_job.py where stress tests were always reporting SUCCESS regardless of actual failures — the code was iterating an empty shadowed variable instead of the real test results list. Once the reporting was fixed, the pre-existing crash became visible in CI, jumping from ~5/week to ~32/week.

Issue 2: Failed `SYSTEM RESTORE DATABASE REPLICA` leaves `ddl_worker` permanently null

restoreDatabaseMetadataInKeeper now stops the DDL worker at the beginning (to prevent a race where the old DDL worker reconnects to ZooKeeper and reads intermediate state during metadata restoration). However, if the restore fails (e.g., when the replica already has valid metadata — REPLICA_ALREADY_EXISTS), reinitializeDDLWorker was never called, leaving ddl_worker as nullptr permanently. Any subsequent DDL query on the database would segfault.

The fix uses SCOPE_EXIT to ensure the DDL worker is always reinitialized if the restore fails, keeping the database functional.

This subsumes and supersedes PR #96117.

Closes #95352.

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

The function accessed `ddl_worker->getCommonHostID()` without holding `ddl_worker_mutex` and without checking if `ddl_worker` is null. Another thread could set `ddl_worker` to nullptr during shutdown or reinitialization, causing a segmentation fault. Added proper locking and null check following the pattern used in other methods like `waitForReplicaToProcessAllEntries`. Fixes #95352 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

clickhouse-gh · 2026-02-02T23:46:07Z

Workflow [PR], commit [409b6a5]

Summary: ❌

job_name	test_name	status	info	comment
Stress test (amd_ubsan)		failure
	Logical error: Unexpected return type from A. Expected B. Got C. Action: (STID: 1611-3d19)	FAIL	cidb, issue

tavplubix · 2026-02-03T19:44:52Z

The function accessed ddl_worker->getCommonHostID() without holding ddl_worker_mutex and without checking if ddl_worker is null

When we reach this point, ddl_worker must already be initialized, and it shouldn't be reset concurrently. So I'm wondering how did this happen

The original fix placed the `ddl_worker` null check at the beginning of `tryEnqueueReplicatedDDL`, which changed the error message users see when a database with a dropped replica tries to execute DDL. The test `02447_drop_database_replica` expects "Database is in readonly mode" error message in this scenario. This commit moves the `is_readonly` check to execute before the `ddl_worker` null check, preserving the expected error message while still protecting against the race condition that caused the original segfault during concurrent shutdown. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

alexey-milovidov · 2026-02-05T05:32:06Z

TLDR it's concurrent shutdown:

waitDatabaseStarted only ensures the initial startup task completes - it doesn't prevent shutdown() from running concurrently after that point

The race condition is:
- Thread A calls tryEnqueueReplicatedDDL, passes waitDatabaseStarted
- Thread B calls shutdown() (e.g., during database drop or stress testing)
- Thread B acquires ddl_worker_mutex, sets ddl_worker = nullptr
- Thread A tries to access ddl_worker->getCommonHostID() → crash

The crash was observed in a stress test (as shown in issue Segmentation fault (STID: 2704-3fd4) #95352), where concurrent operations during shutdown/restart are more likely

The fix adds mutex protection following the same pattern used in waitForReplicaToProcessAllEntries.

alexey-milovidov · 2026-02-05T05:32:51Z

shouldn't be reset concurrently

Maybe it is reset concurrently somehow?

alexey-milovidov · 2026-02-05T17:52:16Z

This might need more investigation.

Stop the DDL worker before restoring database metadata in Keeper to prevent a race condition: the old DDL worker may reconnect to ZooKeeper, read the intermediate state (where table metadata has not been written yet), and mistakenly move local tables to `_broken_replicated_tables`. Use a `SCOPE_EXIT` to reinitialize the DDL worker if the restore fails, so that the database remains functional. Without this, a failed `SYSTEM RESTORE DATABASE REPLICA` (e.g. when the replica already has valid metadata) would leave `ddl_worker` as nullptr, causing a segfault on any subsequent DDL query. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…/ClickHouse into fix-ddl-worker-race-condition

alexey-milovidov · 2026-02-06T20:41:32Z

So the problem is that it is reset concurrently, in the case of Replicated database.

alexey-milovidov · 2026-02-08T04:41:57Z

There are failures in this code every day: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=3092b87d349f70ca8f607553c34df2fc4d53125a&name_0=MasterCI&name_1=Stress%20test%20%28amd_tsan%29&name_1=Stress%20test%20%28amd_tsan%29

We should merge asap.

alexey-milovidov · 2026-02-08T08:25:09Z

At least it does not harm...

tuanpach · 2026-02-12T08:15:28Z

When restore the DB, it creates DB nodes (e.g, metadata, max_log_ptr) first, then, create table metadata nodes later.
However, after metadata node is created , and before table metadata nodes are created, the other nodes might see the metadata on Keeper is empty. It moves all tables to the broken database.

If creating table metadata nodes throws an exception, the DB will be up again with empty table.

I tried to make all the operation (creating DB node, table metadata nodes) atomically in this PR: #94772

PR #95865 fixed the main crash in `tryEnqueueReplicatedDDL` by moving `getCommonHostID` inside the lock, but several other functions still access `ddl_worker` outside the lock after checking it, which can race with `shutdown` setting it to nullptr. Change `ddl_worker` from `unique_ptr` to `shared_ptr` and take local copies under `ddl_worker_mutex` in functions that use it after releasing the lock: - `tryEnqueueReplicatedDDL`: `tryEnqueueAndExecuteEntry` was still called on `ddl_worker` outside the lock - `waitForReplicaToProcessAllEntries`: all three calls outside the lock - `tryGetReplicasInfo`: `isUnsyncedAfterRecovery` outside the lock - `checkDigestValid`: `isCurrentlyActive` without any lock - `dropTable`, `commitCreateTable`, `commitAlterTable`, `detachTablePermanently`, `removeDetachedPermanentlyFlag`: assert accesses that race in debug/sanitizer builds Closes #85774 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alexey-milovidov requested a review from tavplubix February 2, 2026 23:45

clickhouse-gh bot added the pr-ci label Feb 2, 2026

alexey-milovidov marked this pull request as draft February 3, 2026 22:32

alexey-milovidov and others added 2 commits February 5, 2026 06:17

Merge branch 'master' into fix-ddl-worker-race-condition

baa7786

alexey-milovidov marked this pull request as ready for review February 5, 2026 05:33

alexey-milovidov and others added 3 commits February 6, 2026 04:06

Merge branch 'master' into fix-ddl-worker-race-condition

5a677b0

Merge branch 'fix-ddl-worker-race-condition' of github.com:ClickHouse…

fea2ccd

…/ClickHouse into fix-ddl-worker-race-condition

alexey-milovidov mentioned this pull request Feb 6, 2026

Fix race between DDL worker and SYSTEM RESTORE DATABASE REPLICA #96117

Closed

Merge branch 'master' into fix-ddl-worker-race-condition

f7d5f74

Merge branch 'master' into fix-ddl-worker-race-condition

409b6a5

alexey-milovidov self-assigned this Feb 8, 2026

alexey-milovidov merged commit 6a8e042 into master Feb 8, 2026
131 of 134 checks passed

alexey-milovidov deleted the fix-ddl-worker-race-condition branch February 8, 2026 08:25

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 8, 2026

alexey-milovidov mentioned this pull request Feb 11, 2026

[CI crash] Failed to enqueue replicated DDL query #88927

Closed

alexey-milovidov mentioned this pull request Feb 15, 2026

Fix remaining race conditions in DatabaseReplicated ddl_worker access #96972

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL#95865

Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL#95865
alexey-milovidov merged 8 commits intomasterfrom
fix-ddl-worker-race-condition

alexey-milovidov commented Feb 2, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

tavplubix commented Feb 3, 2026

Uh oh!

alexey-milovidov commented Feb 5, 2026

Uh oh!

alexey-milovidov commented Feb 5, 2026

Uh oh!

alexey-milovidov commented Feb 5, 2026

Uh oh!

alexey-milovidov commented Feb 6, 2026

Uh oh!

alexey-milovidov commented Feb 8, 2026

Uh oh!

alexey-milovidov commented Feb 8, 2026

Uh oh!

Uh oh!

tuanpach commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexey-milovidov commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue 1: Race between DDL queries and shutdown (stress tests)

Issue 2: Failed SYSTEM RESTORE DATABASE REPLICA leaves ddl_worker permanently null

Changelog category (leave one):

Uh oh!

clickhouse-gh bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tavplubix commented Feb 3, 2026

Uh oh!

alexey-milovidov commented Feb 5, 2026

Uh oh!

alexey-milovidov commented Feb 5, 2026

Uh oh!

alexey-milovidov commented Feb 5, 2026

Uh oh!

alexey-milovidov commented Feb 6, 2026

Uh oh!

alexey-milovidov commented Feb 8, 2026

Uh oh!

alexey-milovidov commented Feb 8, 2026

Uh oh!

Uh oh!

tuanpach commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented Feb 2, 2026 •

edited

Loading

Issue 1: Race between DDL queries and `shutdown` (stress tests)

Issue 2: Failed `SYSTEM RESTORE DATABASE REPLICA` leaves `ddl_worker` permanently null

clickhouse-gh bot commented Feb 2, 2026 •

edited

Loading