Fix heap-use-after-free in MergeTreeReadTask::createReaders#99483
Conversation
`MergeTreeReadTask::createReaders` accessed `data_part->storage.getContext()` (via `LoadedMergeTreeDataPartInfoForReader` constructor) and `data_part->storage.getSettings()` through a potentially dangling reference. When a table is dropped by `DatabaseCatalog::dropTablesParallel` while a query is still reading from it, the `StorageMergeTree` object is destroyed, but the data parts held by the query still reference it via a bare `const MergeTreeData & storage`. Accessing this dangling reference causes a heap-use-after-free. The fix captures the `ContextPtr` and `MergeTreeSettingsPtr` during read pool construction (when the storage is guaranteed to be alive) and passes them via `MergeTreeReadTask::Extras` to avoid accessing `data_part->storage` during query execution. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=022e467a014b9b3f1fc993735bc4f6b3974e59a0&name_0=MasterCI&name_1=Stress%20test%20%28arm_asan%2C%20s3%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [6cf20e8] Summary: ✅ AI ReviewSummaryThis PR fixes a heap-use-after-free in ClickHouse Rules
Final Verdict
|
…uction The `MergeTreeReadPoolBase` accesses storage through `data_part->storage` (a bare reference) during query execution. If the storage is destroyed concurrently (e.g. by `DatabaseCatalog::dropTablesParallel`), this becomes a dangling reference, causing a heap-use-after-free. The query pipeline normally holds a `StoragePtr` via `QueryPlanResourceHolder`, but certain code paths (e.g. backup restore) may not set this up correctly, allowing the storage to be freed. The fix stores a `ConstStoragePtr` (via `shared_from_this`) in `MergeTreeReadPoolBase` during construction, guaranteeing the storage outlives the read pool regardless of the pipeline's resource management. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=022e467a014b9b3f1fc993735bc4f6b3974e59a0&name_0=MasterCI&name_1=Stress%20test%20%28arm_asan%2C%20s3%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move owned_data_storage before parts_ranges and per_part_infos to ensure it is destroyed after them. Part destructors call clearCaches() -> storage.getContext(), which requires the storage to still be alive. With the previous ordering, owned_data_storage could be destroyed before parts_ranges, causing use-after-free in part destructors. This addresses both crash patterns found in CI: - Pattern 1 (3 occurrences): LoadedMergeTreeDataPartInfoForReader constructor accessing storage.getContext() during query execution - Pattern 2 (16 occurrences): IMergeTreeDataPart::clearCaches() accessing storage.getContext() during part destruction in the read pool destructor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause analysisQuerying Pattern 1 — Pattern 2 — Destruction chain (from CI ASan traces)Both the "freed by" and the crash happen on the same thread — this is not a concurrency race, it's a sequential destruction order issue. The ASan traces from stateless tests (distributed plan) show the crash happens inside And the "freed by" trace (same thread, earlier in the chain) shows the storage was freed by a different Root causeThere are two During destruction on the same thread:
Fix in this PRThe fix stores a A deeper fix would be to always set |
Preserve when removes parts from the , and make retain the actual data storage instead of blindly retaining . This covers wrapper storages such as and prevents from observing dangling during read-pool teardown. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=022e467a014b9b3f1fc993735bc4f6b3974e59a0&name_0=MasterCI&name_1=Stress%20test%20%28arm_asan%2C%20s3%29
|
Further investigation by Codex GPT 5.4 I found the missing explanation for Mechanism B.
I pushed commit
The second point matters for wrapper storages such as I also ran direct |
The test was waiting for `MergeTreeReadersChain` to appear in the `--server_logs_file`, but `WriteBufferFromFile` uses a 1 MB buffer (`DBMS_DEFAULT_BUFFER_SIZE`) that is only flushed when the query finishes or the buffer fills up. So `grep` on the file while the query is running never sees the text, and the test times out. Replace the log-file-based synchronization with a check on `system.processes.read_rows > 0`, which reliably detects that the query has started reading (and therefore has created readers that hold data parts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| --enable_shared_storage_snapshot_in_query 0 \ | ||
| --merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability 0 \ | ||
| --query " | ||
| SELECT sleepEachRow(0.01) |
There was a problem hiding this comment.
MergeTreeReadTask::createReaders still dereferences data_part->storage.getSettings(). Please add/adjust coverage to exercise projection-index reads under teardown race as well.
Exercise `MergeTreeReadPoolProjectionIndex` during teardown to cover the `data_part->storage.getSettings()` dereference in `createReaders` on projection parts. #99483 (review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…teardown_race` The test checked for unexpected exceptions by grepping for "Exception" in stderr, then excluding lines containing "QUERY_WAS_CANCELLED". However, the `KILL QUERY` stack trace includes lines like `DB::Exception::Exception` that do not contain the string "QUERY_WAS_CANCELLED", causing false positives. Use `grep -F "Code:"` instead, which matches actual error message lines (e.g. `Code: 394. DB::Exception: ...`) rather than stack trace class names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,2 @@ | |||
| - `ReadFromMergeTree::initializePipeline` can replace `storage_snapshot->data` during `merge_tree_enable_remove_parts_from_snapshot_optimization`; when stripping `SnapshotData::parts`, preserve `SnapshotData::storage` or `IMergeTreeDataPart::clearCaches` can observe a dangling `data_part->storage`. | |||
There was a problem hiding this comment.
.claude/learnings.md from the PR before merge to avoid committing tool-specific internal notes into the repository history.
| @@ -0,0 +1,2 @@ | |||
| - `ReadFromMergeTree::initializePipeline` can replace `storage_snapshot->data` during `merge_tree_enable_remove_parts_from_snapshot_optimization`; when stripping `SnapshotData::parts`, preserve `SnapshotData::storage` or `IMergeTreeDataPart::clearCaches` can observe a dangling `data_part->storage`. | |||
| - `MergeTreeReadPoolBase` must retain the storage referenced by `MergeTreeData::SnapshotData::storage` or the parts themselves, not blindly `storage_snapshot->storage`, because wrappers such as `StorageFromMergeTreeProjection` are not the same object as `data_part->storage`. | |||
There was a problem hiding this comment.
StorageFromMergeTreeProjection has parent_storage which points to the MergeTreeData
| namespace | ||
| { | ||
|
|
||
| ConstStoragePtr getOwnedDataStorage(const StorageSnapshotPtr & storage_snapshot, const RangesInDataParts * parts_ranges = nullptr) |
There was a problem hiding this comment.
This looks hacky, are you sure that we need this, after the fix for enable_shared_storage_snapshot_in_query == false?
| @@ -0,0 +1,2 @@ | |||
| - `ReadFromMergeTree::initializePipeline` can replace `storage_snapshot->data` during `merge_tree_enable_remove_parts_from_snapshot_optimization`; when stripping `SnapshotData::parts`, preserve `SnapshotData::storage` or `IMergeTreeDataPart::clearCaches` can observe a dangling `data_part->storage`. | |||
The `ReadFromMergeTree.cpp` fix already preserves `SnapshotData::storage` when stripping the snapshot, so the storage is kept alive via `storage_snapshot` which outlives `parts_ranges` and `per_part_infos` due to member declaration order. The extra `owned_data_storage` member is redundant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Ok. Now the change is minimal. And it will be amazing if we finally nail it! |
…free-merge-tree-read-task
The test used `today()` for the non-expired row, but with randomized `session_timezone` settings (e.g. UTC-7), the Date value stored can be yesterday in UTC terms, making the TTL `age + 1 day` expire when the server evaluates it in UTC. Use a far-future date instead. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99483&sha=bc5fcd5658a549c3eb59b2d9c27fb4b9b2fc966c&name_0=PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 100.00% (8/8, 0 noise lines excluded) |
Backport #99483 to 26.3: Fix heap-use-after-free in MergeTreeReadTask::createReaders
Backport #99483 to 26.1: Fix heap-use-after-free in MergeTreeReadTask::createReaders
Backport #99483 to 26.2: Fix heap-use-after-free in MergeTreeReadTask::createReaders
…r-free-merge-tree-read-task Fix heap-use-after-free in MergeTreeReadTask::createReaders Signed-off-by: Ilya Golshtein <igolshtein@altinity.com>
Summary
Fix heap-use-after-free where
MergeTreeReadPoolBaseaccesses storage throughdata_part->storage(a bareconst MergeTreeData &reference) after the storage has been destroyed.Querying
checks.test_context_rawon ClickHouse Playground reveals 19 occurrences of the same root cause in the last 90 days across two distinct crash patterns:Pattern 1 —
LoadedMergeTreeDataPartInfoForReaderconstructor (3 occurrences): During query execution,createReadersconstructsLoadedMergeTreeDataPartInfoForReaderwhich callsdata_part->storage.getContext(). If the storage is already freed, this is a use-after-free.Pattern 2 —
IMergeTreeDataPart::clearCachesduring part destruction (16 occurrences): When the read pool is destroyed, data parts are released. Part destructors callremoveIfNeeded()→clearCaches()→storage.getContext()to clear mark/uncompressed caches. If the storage is freed before parts, this crashes.Root cause: The query pipeline normally holds a
StoragePtrviaQueryPlanResourceHolder, but the read pool itself does not hold a storage reference. If any code path fails to set upaddStorageHolderat the pipeline level (or if the pipeline resources are released before the pool is destroyed), the storage can be freed while parts still reference it.Fix: Store a
ConstStoragePtr(viashared_from_this) inMergeTreeReadPoolBase. It is declared as the first data member so it is destroyed last, guaranteeing the storage outlives all data parts held by the pool (parts_ranges,per_part_infos). This fixes both crash patterns.CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=022e467a014b9b3f1fc993735bc4f6b3974e59a0&name_0=MasterCI&name_1=Stress%20test%20%28arm_asan%2C%20s3%29
Test plan
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix heap-use-after-free when a table is dropped concurrently with a running read query (19 occurrences in CI over the last 90 days).
🤖 Generated with Claude Code