Fix Block structure mismatch with normal projections and parallel replicas#99497
Fix Block structure mismatch with normal projections and parallel replicas#99497alexey-milovidov wants to merge 16 commits intomasterfrom
Conversation
… parallel replicas When `optimizeUseNormalProjections` replaces the read chain with a projection-based read, the new chain may output extra columns (e.g., `region` from PREWHERE restoration) that the original chain did not have. This causes a "Block structure mismatch in UnionStep stream" exception in debug/sanitizer builds when the local plan (with projection) is combined with the remote plan (without projection) in parallel replicas. The fix adds a converting step when the projection chain's output header differs from the expected output, ensuring headers stay consistent. This mirrors the existing header compatibility check already present for the mixed projection+parent-parts case (lines 459-460). https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=73f45b0717e3a4763b7d1cde71950a076b5b00c7&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [fb29112] Summary: ❌
AI ReviewSummaryThis PR fixes a ClickHouse Rules
Final Verdict
|
|
Here is the reason we can have unused columns after PREWHERE in header - It fixes another issue. So, it seems it's a job for |
devcrafter
left a comment
There was a problem hiding this comment.
It's more symptom fix. Proper fix will take some effort - some thoughts here
The previous test had too few rows (2) to trigger `optimizePrewhere`, which is needed to produce the extra PREWHERE restoration columns that cause the Block structure mismatch. With 10000 rows, column sizes are meaningful and the `MergeTreeWhereOptimizer` moves the filter to PREWHERE. The PREWHERE column restoration then adds `region` as an extra output that persists through `ActionsDAG::mergeInplace` into the projection query DAG, causing the header mismatch with the remote replica plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bugfix validation was failing because commit 0f305e6 (already in master) moved the debug assertion in `UnionStep::updatePipeline` after the header conversion code. This means the pipeline-level safety net silently handles mismatched headers even in debug builds, so the test could never reproduce the runtime error against the master binary. Add an EXPLAIN PLAN check that detects the plan-level fix: the converting `ExpressionStep` added by `optimizeUseNormalProjection` creates two consecutive Expression steps in the plan (converting + query processing). Without the fix, only the query processing Expression exists. This structural difference is observable regardless of build type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
Use `groupArray` + `arrayExists` with a lambda to detect consecutive Expression steps, instead of a self-join on two separate EXPLAIN queries. This is both simpler and avoids potential non-determinism from running two independent EXPLAIN executions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
The EXPLAIN-based plan check for consecutive Expression steps was unreliable: it returned the same result on both the master binary (without the plan-level fix) and the patched binary because other plan optimizations also create consecutive Expression steps. The actual runtime fix for the `Block structure mismatch in UnionStep` exception was merged via PR #99515, which moved the debug assertion in `UnionStep::updatePipeline` after the header conversion code. This PR's plan-level fix in `optimizeUseNormalProjections` is a defense-in-depth measure that prevents the header mismatch from reaching the pipeline build stage. Keep only the query correctness check as a regression test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
The `optimizeLazyMaterialization2` optimization adds its own projection step that strips extra PREWHERE columns, masking the header mismatch bug in `optimizeUseNormalProjections`. Disable it so the bugfix validation can distinguish between the fixed and unfixed code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| -- Since `region` is NOT in SELECT, optimizePrewhere restores it as an | ||
| -- extra column that persists in the projection query DAG, causing a | ||
| -- header mismatch with the remote plan in the parallel replicas UnionStep. | ||
| SELECT url FROM test_projection_pr WHERE region = 'europe' ORDER BY url LIMIT 1; |
There was a problem hiding this comment.
#99515 moved the UnionStep header conversion into the pipeline path, this query can succeed even without the optimizeUseNormalProjections conversion added in this PR. That means this test can pass while the PR-specific fix regresses.
Please add an assertion that is specific to this optimization path (for example, a deterministic plan/pipeline shape check tied to the conversion step inserted here), so the test fails on code without this PR change and passes with it.
LLVM Coverage Report
Changed lines: 100.00% (18/18) · Uncovered code |
…llel-replicas-mismatch
The bugfix validation test was failing because the query-success check alone cannot distinguish between master and this PR — PR #99515 already handles the header mismatch at pipeline level, so the query succeeds on both. Add a `setStepDescription` call to the converting `ExpressionStep` in `optimizeUseNormalProjections` so it appears as "Convert projection output to match expected header" in EXPLAIN output. The test now checks for this specific step description, which is absent on master and present with this fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When
optimizeUseNormalProjectionsreplaces the read chain with a projection-based read where all parts come from the projection, the new chain may produce extra columns (e.g. from PREWHERE column restoration insplitAndFillPrewhereInfo). With parallel replicas, the local plan uses the projection while the remote plan does not, so theUnionStepcombining them gets mismatched headers — the local side has extra columns the remote side doesn't.The fix adds a converting step when the projection chain's output differs from the original chain's output. This mirrors the existing compatibility check already present for the mixed projection+parent-parts code path (line 459).
Closes #96131. Closes #95191.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=73f45b0717e3a4763b7d1cde71950a076b5b00c7&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix exception "Block structure mismatch in UnionStep stream" when using normal projections with parallel replicas.
Documentation entry for user-facing changes