Skip to content

Fix Block structure mismatch with normal projections and parallel replicas#99497

Open
alexey-milovidov wants to merge 16 commits intomasterfrom
fix-projection-parallel-replicas-mismatch
Open

Fix Block structure mismatch with normal projections and parallel replicas#99497
alexey-milovidov wants to merge 16 commits intomasterfrom
fix-projection-parallel-replicas-mismatch

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

When optimizeUseNormalProjections replaces the read chain with a projection-based read where all parts come from the projection, the new chain may produce extra columns (e.g. from PREWHERE column restoration in splitAndFillPrewhereInfo). With parallel replicas, the local plan uses the projection while the remote plan does not, so the UnionStep combining them gets mismatched headers — the local side has extra columns the remote side doesn't.

The fix adds a converting step when the projection chain's output differs from the original chain's output. This mirrors the existing compatibility check already present for the mixed projection+parent-parts code path (line 459).

Closes #96131. Closes #95191.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=73f45b0717e3a4763b7d1cde71950a076b5b00c7&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix exception "Block structure mismatch in UnionStep stream" when using normal projections with parallel replicas.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

… parallel replicas

When `optimizeUseNormalProjections` replaces the read chain with a
projection-based read, the new chain may output extra columns (e.g.,
`region` from PREWHERE restoration) that the original chain did not have.
This causes a "Block structure mismatch in UnionStep stream" exception
in debug/sanitizer builds when the local plan (with projection) is
combined with the remote plan (without projection) in parallel replicas.

The fix adds a converting step when the projection chain's output header
differs from the expected output, ensuring headers stay consistent.
This mirrors the existing header compatibility check already present
for the mixed projection+parent-parts case (lines 459-460).

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=73f45b0717e3a4763b7d1cde71950a076b5b00c7&name_0=MasterCI&name_1=Stress%20test%20%28azure%2C%20amd_msan%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 14, 2026

Workflow [PR], commit [fb29112]

Summary:

job_name test_name status info comment
Fast test failure
04039_normal_projection_parallel_replicas FAIL cidb
clickhouse-test FAIL cidb
Build (amd_debug) dropped
Build (amd_asan_ubsan) dropped
Build (amd_tsan) dropped
Build (amd_msan) dropped
Build (amd_binary) dropped
Build (arm_debug) dropped
Build (arm_asan_ubsan) dropped
Build (arm_tsan) dropped
Build (arm_msan) dropped

AI Review

Summary

This PR fixes a UnionStep header mismatch for normal projections with parallel replicas by inserting a projection-output conversion step when the projection-only branch header differs from the expected read header. The change is surgical, consistent with existing mixed-branch compatibility handling, and covered by a dedicated stateless regression test that also checks for the optimizer-introduced conversion step in EXPLAIN PLAN.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 14, 2026
@devcrafter devcrafter self-assigned this Mar 15, 2026
@devcrafter
Copy link
Copy Markdown
Member

devcrafter commented Mar 17, 2026

Here is the reason we can have unused columns after PREWHERE in header -

/// This is the leak of abstraction.
/// Split actions may have inputs which are needed only for PREWHERE.
/// This is fine for ActionsDAG to have such a split, but it breaks defaults calculation.
///
/// See 00950_default_prewhere for example.
/// Table has structure `APIKey UInt8, SessionType UInt8` and default `OperatingSystem = SessionType+1`
/// For a query with `SELECT OperatingSystem WHERE APIKey = 42 AND SessionType = 42` we push everything to PREWHERE
/// and columns APIKey, SessionType are removed from inputs (cause only OperatingSystem is needed).
/// However, column OperatingSystem is calculated after PREWHERE stage, based on SessionType value.
/// If column SessionType is removed by PREWHERE actions, we use zero as default, and get a wrong result.
///
/// So, here we restore removed inputs for PREWHERE actions
{
std::unordered_set<const ActionsDAG::Node *> first_outputs(
split_result.first.getOutputs().begin(), split_result.first.getOutputs().end());
for (const auto * input : split_result.first.getInputs())
{
if (!first_outputs.contains(input))
{
split_result.first.getOutputs().push_back(input);
/// Add column to second actions as input.
/// Do not add it to result, so it would be removed.
split_result.second.addInput(input->result_name, input->result_type);
}
}
}

It fixes another issue. So, it seems it's a job for query_plan_remove_unused_columns optimization, but currently it's not happening

Copy link
Copy Markdown
Member

@devcrafter devcrafter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more symptom fix. Proper fix will take some effort - some thoughts here

alexey-milovidov and others added 10 commits March 24, 2026 00:32
The previous test had too few rows (2) to trigger `optimizePrewhere`,
which is needed to produce the extra PREWHERE restoration columns that
cause the Block structure mismatch. With 10000 rows, column sizes are
meaningful and the `MergeTreeWhereOptimizer` moves the filter to
PREWHERE. The PREWHERE column restoration then adds `region` as an
extra output that persists through `ActionsDAG::mergeInplace` into the
projection query DAG, causing the header mismatch with the remote
replica plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bugfix validation was failing because commit 0f305e6 (already
in master) moved the debug assertion in `UnionStep::updatePipeline`
after the header conversion code. This means the pipeline-level safety
net silently handles mismatched headers even in debug builds, so the
test could never reproduce the runtime error against the master binary.

Add an EXPLAIN PLAN check that detects the plan-level fix: the
converting `ExpressionStep` added by `optimizeUseNormalProjection`
creates two consecutive Expression steps in the plan (converting +
query processing). Without the fix, only the query processing
Expression exists. This structural difference is observable regardless
of build type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use `groupArray` + `arrayExists` with a lambda to detect consecutive
Expression steps, instead of a self-join on two separate EXPLAIN
queries. This is both simpler and avoids potential non-determinism
from running two independent EXPLAIN executions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The EXPLAIN-based plan check for consecutive Expression steps was
unreliable: it returned the same result on both the master binary
(without the plan-level fix) and the patched binary because other
plan optimizations also create consecutive Expression steps.

The actual runtime fix for the `Block structure mismatch in UnionStep`
exception was merged via PR #99515, which moved the debug assertion
in `UnionStep::updatePipeline` after the header conversion code.
This PR's plan-level fix in `optimizeUseNormalProjections` is a
defense-in-depth measure that prevents the header mismatch from
reaching the pipeline build stage.

Keep only the query correctness check as a regression test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `optimizeLazyMaterialization2` optimization adds its own projection
step that strips extra PREWHERE columns, masking the header mismatch bug
in `optimizeUseNormalProjections`. Disable it so the bugfix validation
can distinguish between the fixed and unfixed code paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-- Since `region` is NOT in SELECT, optimizePrewhere restores it as an
-- extra column that persists in the projection query DAG, causing a
-- header mismatch with the remote plan in the parallel replicas UnionStep.
SELECT url FROM test_projection_pr WHERE region = 'europe' ORDER BY url LIMIT 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This regression check currently validates only query success, but after #99515 moved the UnionStep header conversion into the pipeline path, this query can succeed even without the optimizeUseNormalProjections conversion added in this PR. That means this test can pass while the PR-specific fix regresses.

Please add an assertion that is specific to this optimization path (for example, a deterministic plan/pipeline shape check tied to the conversion step inserted here), so the test fails on code without this PR change and passes with it.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 30, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 100.00% (18/18) · Uncovered code

Full report · Diff report

alexey-milovidov and others added 3 commits March 30, 2026 16:38
The bugfix validation test was failing because the query-success check
alone cannot distinguish between master and this PR — PR #99515 already
handles the header mismatch at pipeline level, so the query succeeds on
both.

Add a `setStepDescription` call to the converting `ExpressionStep` in
`optimizeUseNormalProjections` so it appears as "Convert projection
output to match expected header" in EXPLAIN output. The test now checks
for this specific step description, which is absent on master and
present with this fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

2 participants