Skip to content

Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL#95865

Merged
alexey-milovidov merged 8 commits intomasterfrom
fix-ddl-worker-race-condition
Feb 8, 2026
Merged

Fix race condition in DatabaseReplicated::tryEnqueueReplicatedDDL#95865
alexey-milovidov merged 8 commits intomasterfrom
fix-ddl-worker-race-condition

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov commented Feb 2, 2026

Summary

Fixes two race conditions involving ddl_worker in DatabaseReplicated:

Issue 1: Race between DDL queries and shutdown (stress tests)

tryEnqueueReplicatedDDL accessed ddl_worker->getCommonHostID() without holding ddl_worker_mutex and without checking if ddl_worker is null. Another thread could set ddl_worker to nullptr during shutdown (e.g., when dropping the database), causing a segmentation fault at address 0x77 — the offset of the host_fqdn_id member from a null DDLWorker pointer.

The fix adds proper locking and a null check following the pattern used in other methods like waitForReplicaToProcessAllEntries.

Root cause of the recent spike in failures: The bug existed for a long time but was invisible because PR #95081 (merged Jan 25) fixed a critical bug in ci/jobs/stress_job.py where stress tests were always reporting SUCCESS regardless of actual failures — the code was iterating an empty shadowed variable instead of the real test results list. Once the reporting was fixed, the pre-existing crash became visible in CI, jumping from ~5/week to ~32/week.

Issue 2: Failed SYSTEM RESTORE DATABASE REPLICA leaves ddl_worker permanently null

restoreDatabaseMetadataInKeeper now stops the DDL worker at the beginning (to prevent a race where the old DDL worker reconnects to ZooKeeper and reads intermediate state during metadata restoration). However, if the restore fails (e.g., when the replica already has valid metadata — REPLICA_ALREADY_EXISTS), reinitializeDDLWorker was never called, leaving ddl_worker as nullptr permanently. Any subsequent DDL query on the database would segfault.

The fix uses SCOPE_EXIT to ensure the DDL worker is always reinitialized if the restore fails, keeping the database functional.

This subsumes and supersedes PR #96117.

Closes #95352.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

The function accessed `ddl_worker->getCommonHostID()` without holding
`ddl_worker_mutex` and without checking if `ddl_worker` is null.
Another thread could set `ddl_worker` to nullptr during shutdown or
reinitialization, causing a segmentation fault.

Added proper locking and null check following the pattern used in other
methods like `waitForReplicaToProcessAllEntries`.

Fixes #95352

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 2, 2026

Workflow [PR], commit [409b6a5]

Summary:

job_name test_name status info comment
Stress test (amd_ubsan) failure
Logical error: Unexpected return type from A. Expected B. Got C. Action: (STID: 1611-3d19) FAIL cidb, issue

@clickhouse-gh clickhouse-gh bot added the pr-ci label Feb 2, 2026
@tavplubix
Copy link
Copy Markdown
Member

The function accessed ddl_worker->getCommonHostID() without holding ddl_worker_mutex and without checking if ddl_worker is null

When we reach this point, ddl_worker must already be initialized, and it shouldn't be reset concurrently. So I'm wondering how did this happen

@alexey-milovidov alexey-milovidov marked this pull request as draft February 3, 2026 22:32
alexey-milovidov and others added 2 commits February 5, 2026 06:17
The original fix placed the `ddl_worker` null check at the beginning of
`tryEnqueueReplicatedDDL`, which changed the error message users see when
a database with a dropped replica tries to execute DDL. The test
`02447_drop_database_replica` expects "Database is in readonly mode"
error message in this scenario.

This commit moves the `is_readonly` check to execute before the
`ddl_worker` null check, preserving the expected error message while
still protecting against the race condition that caused the original
segfault during concurrent shutdown.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member Author

TLDR it's concurrent shutdown:

  1. waitDatabaseStarted only ensures the initial startup task completes - it doesn't prevent shutdown() from running concurrently after that point
  2. The race condition is:
    - Thread A calls tryEnqueueReplicatedDDL, passes waitDatabaseStarted
    - Thread B calls shutdown() (e.g., during database drop or stress testing)
    - Thread B acquires ddl_worker_mutex, sets ddl_worker = nullptr
    - Thread A tries to access ddl_worker->getCommonHostID() → crash
  3. The crash was observed in a stress test (as shown in issue Segmentation fault (STID: 2704-3fd4) #95352), where concurrent operations during shutdown/restart are more likely

The fix adds mutex protection following the same pattern used in waitForReplicaToProcessAllEntries.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

shouldn't be reset concurrently

Maybe it is reset concurrently somehow?

@alexey-milovidov alexey-milovidov marked this pull request as ready for review February 5, 2026 05:33
@alexey-milovidov
Copy link
Copy Markdown
Member Author

This might need more investigation.

alexey-milovidov and others added 3 commits February 6, 2026 04:06
Stop the DDL worker before restoring database metadata in Keeper
to prevent a race condition: the old DDL worker may reconnect to
ZooKeeper, read the intermediate state (where table metadata has
not been written yet), and mistakenly move local tables to
`_broken_replicated_tables`.

Use a `SCOPE_EXIT` to reinitialize the DDL worker if the restore
fails, so that the database remains functional. Without this, a
failed `SYSTEM RESTORE DATABASE REPLICA` (e.g. when the replica
already has valid metadata) would leave `ddl_worker` as nullptr,
causing a segfault on any subsequent DDL query.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…/ClickHouse into fix-ddl-worker-race-condition
@alexey-milovidov
Copy link
Copy Markdown
Member Author

So the problem is that it is reset concurrently, in the case of Replicated database.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov self-assigned this Feb 8, 2026
@alexey-milovidov
Copy link
Copy Markdown
Member Author

At least it does not harm...

@alexey-milovidov alexey-milovidov merged commit 6a8e042 into master Feb 8, 2026
131 of 134 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-ddl-worker-race-condition branch February 8, 2026 08:25
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 8, 2026
@tuanpach
Copy link
Copy Markdown
Member

When restore the DB, it creates DB nodes (e.g, metadata, max_log_ptr) first, then, create table metadata nodes later.
However, after metadata node is created , and before table metadata nodes are created, the other nodes might see the metadata on Keeper is empty. It moves all tables to the broken database.

If creating table metadata nodes throws an exception, the DB will be up again with empty table.

I tried to make all the operation (creating DB node, table metadata nodes) atomically in this PR: #94772

alexey-milovidov added a commit that referenced this pull request Feb 15, 2026
PR #95865 fixed the main crash in `tryEnqueueReplicatedDDL` by moving
`getCommonHostID` inside the lock, but several other functions still
access `ddl_worker` outside the lock after checking it, which can race
with `shutdown` setting it to nullptr.

Change `ddl_worker` from `unique_ptr` to `shared_ptr` and take local
copies under `ddl_worker_mutex` in functions that use it after releasing
the lock:

- `tryEnqueueReplicatedDDL`: `tryEnqueueAndExecuteEntry` was still
  called on `ddl_worker` outside the lock
- `waitForReplicaToProcessAllEntries`: all three calls outside the lock
- `tryGetReplicasInfo`: `isUnsyncedAfterRecovery` outside the lock
- `checkDigestValid`: `isCurrentlyActive` without any lock
- `dropTable`, `commitCreateTable`, `commitAlterTable`,
  `detachTablePermanently`, `removeDetachedPermanentlyFlag`: assert
  accesses that race in debug/sanitizer builds

Closes #85774

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation fault (STID: 2704-3fd4)

4 participants