Skip to content

Fix table drop getting permanently stuck with zero-copy replication#96965

Merged
alexey-milovidov merged 2 commits intomasterfrom
fix-drop-table-zero-copy-stuck
Feb 16, 2026
Merged

Fix table drop getting permanently stuck with zero-copy replication#96965
alexey-milovidov merged 2 commits intomasterfrom
fix-drop-table-zero-copy-stuck

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Fix dropAllData permanently blocking DROP DATABASE DDL when leftover part directories exist on disk with zero-copy replication enabled
  • Replace ZERO_COPY_REPLICATION_ERROR exception (which caused infinite retry in DatabaseCatalog) with removeSharedRecursive using keep_all_shared_data=true to safely clean up local metadata while preserving shared S3 objects
  • Follows the same pattern already used for detached/ directory cleanup and temporary directory cleanup in the same function

The issue occurs when a part directory exists on disk but isn't tracked in data_parts_by_info (e.g., a broken part with corrupted data.bin that couldn't be loaded). The thrown exception propagates to DatabaseCatalog::dropTablesParallel, which retries forever since the condition never resolves, blocking subsequent DDL operations and causing cascading test failures.

Closes #82676

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

🤖 Generated with Claude Code

When `dropAllData` encounters leftover part directories on disk that
weren't tracked in `data_parts_by_info` (e.g., broken parts that
couldn't be loaded), it threw `ZERO_COPY_REPLICATION_ERROR` to avoid
blindly calling `removeRecursive` on a zero-copy disk. However, this
exception caused `DatabaseCatalog` to endlessly retry the drop every
few seconds, since the leftover directory never goes away on its own.
This blocked `DROP DATABASE` DDL entries from completing, causing
cascading timeout failures in concurrent tests.

Replace the throw with `removeSharedRecursive` using
`keep_all_shared_data=true`, which safely removes local metadata
while preserving shared objects (e.g., S3 data) that other replicas
may still reference. This follows the same pattern already used for
cleaning up the `detached/` directory and temporary directories in
the same function.

Closes #82676

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 15, 2026

Workflow [PR], commit [662ff3b]

Summary:

job_name test_name status info comment
AST fuzzer (arm_asan) failure
Logical error: Unexpected return type from A. Expected B. Got C (STID: 3344-4ec5) FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-ci label Feb 15, 2026
@alexey-milovidov alexey-milovidov self-assigned this Feb 16, 2026
@alexey-milovidov alexey-milovidov merged commit 1c3256b into master Feb 16, 2026
137 of 141 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-drop-table-zero-copy-stuck branch February 16, 2026 06:20
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 16, 2026
@CheSema CheSema assigned CheSema and unassigned alexey-milovidov Mar 16, 2026
@CheSema CheSema added the post-approved Approved, but after the PR is merged. label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

post-approved Approved, but after the PR is merged. pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stateless tests: "Internal query (CREATE/DROP DATABASE) failed" (DatabaseReplicated)

3 participants