Fix Block structure mismatch with normal projections and parallel replicas#99497
Conversation
… parallel replicas When `optimizeUseNormalProjections` replaces the read chain with a projection-based read, the new chain may output extra columns (e.g., `region` from PREWHERE restoration) that the original chain did not have. This causes a "Block structure mismatch in UnionStep stream" exception in debug/sanitizer builds when the local plan (with projection) is combined with the remote plan (without projection) in parallel replicas. The fix adds a converting step when the projection chain's output header differs from the expected output, ensuring headers stay consistent. This mirrors the existing header compatibility check already present for the mixed projection+parent-parts case (lines 459-460). https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=73f45b0717e3a4763b7d1cde71950a076b5b00c7&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [eb9f32e] Summary: ✅ AI ReviewSummaryThis PR adds a new stateless regression test for the Findings
Tests
ClickHouse Rules
Final Verdict
|
|
Here is the reason we can have unused columns after PREWHERE in header - It fixes another issue. So, it seems it's a job for |
devcrafter
left a comment
There was a problem hiding this comment.
It's more symptom fix. Proper fix will take some effort - some thoughts here
The previous test had too few rows (2) to trigger `optimizePrewhere`, which is needed to produce the extra PREWHERE restoration columns that cause the Block structure mismatch. With 10000 rows, column sizes are meaningful and the `MergeTreeWhereOptimizer` moves the filter to PREWHERE. The PREWHERE column restoration then adds `region` as an extra output that persists through `ActionsDAG::mergeInplace` into the projection query DAG, causing the header mismatch with the remote replica plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bugfix validation was failing because commit 0f305e6 (already in master) moved the debug assertion in `UnionStep::updatePipeline` after the header conversion code. This means the pipeline-level safety net silently handles mismatched headers even in debug builds, so the test could never reproduce the runtime error against the master binary. Add an EXPLAIN PLAN check that detects the plan-level fix: the converting `ExpressionStep` added by `optimizeUseNormalProjection` creates two consecutive Expression steps in the plan (converting + query processing). Without the fix, only the query processing Expression exists. This structural difference is observable regardless of build type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
Use `groupArray` + `arrayExists` with a lambda to detect consecutive Expression steps, instead of a self-join on two separate EXPLAIN queries. This is both simpler and avoids potential non-determinism from running two independent EXPLAIN executions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
The EXPLAIN-based plan check for consecutive Expression steps was unreliable: it returned the same result on both the master binary (without the plan-level fix) and the patched binary because other plan optimizations also create consecutive Expression steps. The actual runtime fix for the `Block structure mismatch in UnionStep` exception was merged via PR #99515, which moved the debug assertion in `UnionStep::updatePipeline` after the header conversion code. This PR's plan-level fix in `optimizeUseNormalProjections` is a defense-in-depth measure that prevents the header mismatch from reaching the pipeline build stage. Keep only the query correctness check as a regression test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
The `optimizeLazyMaterialization2` optimization adds its own projection step that strips extra PREWHERE columns, masking the header mismatch bug in `optimizeUseNormalProjections`. Disable it so the bugfix validation can distinguish between the fixed and unfixed code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: 100.00% (18/18) · Uncovered code |
…llel-replicas-mismatch
The bugfix validation test was failing because the query-success check alone cannot distinguish between master and this PR — PR #99515 already handles the header mismatch at pipeline level, so the query succeeds on both. Add a `setStepDescription` call to the converting `ExpressionStep` in `optimizeUseNormalProjections` so it appears as "Convert projection output to match expected header" in EXPLAIN output. The test now checks for this specific step description, which is absent on master and present with this fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llel-replicas-mismatch
The EXPLAIN check for "Convert projection output to match expected header" returns 0 on current master because the projection and original reading output headers now match (PREWHERE column handling was improved upstream). The converting step in `optimizeUseNormalProjections` is defensive code that correctly handles header mismatches when they occur, but the test scenario no longer triggers it. Replace the assertion with a check that the `region_proj` projection is actually used in the plan, which validates that the projection optimization runs correctly with parallel replicas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The converting step in `optimizeUseNormalProjections` is no longer needed: the header mismatch is handled at the pipeline level by #99515, and recent master changes to PREWHERE column handling eliminated the mismatch at the plan level too. Keep the test as a regression test for normal projections with parallel replicas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Now test-only change. |
When
optimizeUseNormalProjectionsreplaces the read chain with a projection-based read where all parts come from the projection, the new chain may produce extra columns (e.g. from PREWHERE column restoration insplitAndFillPrewhereInfo). With parallel replicas, the local plan uses the projection while the remote plan does not, so theUnionStepcombining them gets mismatched headers — the local side has extra columns the remote side doesn't.The fix adds a converting step when the projection chain's output differs from the original chain's output. This mirrors the existing compatibility check already present for the mixed projection+parent-parts code path (line 459).
Closes #96131. Closes #95191.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=73f45b0717e3a4763b7d1cde71950a076b5b00c7&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix exception "Block structure mismatch in UnionStep stream" when using normal projections with parallel replicas.
Documentation entry for user-facing changes