Skip to content

Fix column count mismatch in HashJoin::getNonJoinedBlocks after join side swap#100617

Open
alexey-milovidov wants to merge 4 commits intomasterfrom
fix-join-column-overlap-getNonJoinedBlocks
Open

Fix column count mismatch in HashJoin::getNonJoinedBlocks after join side swap#100617
alexey-milovidov wants to merge 4 commits intomasterfrom
fix-join-column-overlap-getNonJoinedBlocks

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

After join side swapping (e.g., LEFT ANTI → RIGHT ANTI for GLOBAL NOT IN), a column name can appear in both the left block and sample_block_with_columns_to_add or required_right_keys. There can also be duplicates within the left block itself. The result_sample_block deduplicates such columns, so the previous check that compared raw column counts was too strict and triggered a LOGICAL_ERROR exception.

Replace the raw-sum check with a unique-name-set comparison that correctly handles all deduplication cases.

Closes #100422

CI reports:

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix LOGICAL_ERROR exception "Unexpected number of columns in result sample block" in HashJoin::getNonJoinedBlocks that could occur with GLOBAL IN / GLOBAL NOT IN queries when the join optimizer swaps sides.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

…n side swap

After join side swapping (e.g., LEFT ANTI → RIGHT ANTI for GLOBAL NOT IN),
a column name can appear in both the left block and
`sample_block_with_columns_to_add` or `required_right_keys`. There can also
be duplicates within the left block itself. The `result_sample_block`
deduplicates such columns, so the previous check that compared raw column
counts (left + right_keys + right_add) against `result_sample_block.columns()`
was too strict and triggered a LOGICAL_ERROR exception.

Replace the raw-sum check with a unique-name-set comparison that correctly
handles all deduplication cases.

Closes #100422

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100378&sha=ce356689f6a6d126e078f5cbfcbab8b3849de673&name_0=PR&name_1=AST%20fuzzer%20%28amd_tsan%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 24, 2026

Workflow [PR], commit [79cad63]

Summary:

job_name test_name status info comment
Fast test failure
Server died FAIL cidb
clickhouse-test FAIL cidb
Build (amd_debug) dropped
Build (amd_asan_ubsan) dropped
Build (amd_tsan) dropped
Build (amd_msan) dropped
Build (amd_binary) dropped
Build (arm_debug) dropped
Build (arm_asan_ubsan) dropped
Build (arm_tsan) dropped
Build (arm_msan) dropped

AI Review

Summary

This PR fixes a real LOGICAL_ERROR path in HashJoin::getNonJoinedBlocks by replacing a raw column-count invariant with a unique-name-set comparison, which matches actual deduplication behavior after join-side swap. The core logic change looks correct and targeted, and the PR includes a dedicated regression test. I found one major issue in test robustness (already noted by an existing inline comment) and one additional major issue in exception-message quality.

Findings

⚠️ Majors

  • [src/Interpreters/HashJoin/HashJoin.cpp:1743] The invariant now validates set equality, but the thrown error still says Unexpected number of columns and reports an additive decomposition ([left] + [right] + [to_add]) that is not semantically correct after deduplication. This makes debugging future mismatches unnecessarily confusing.
    • Suggested fix: reword the message to describe column-set mismatch and include expected/actual names (or missing/extra names) instead of count arithmetic.
Tests
  • ⚠️ tests/queries/0_stateless/04057_global_not_in_join_swap_column_overlap.sh currently suppresses all failures for both repro queries via >/dev/null 2>&1 ||:. This means the test can pass even if the original LOGICAL_ERROR regresses. Add an assertion that the specific error text is absent (or fail on non-zero exit except for explicitly allowed errors).
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Make the new regression test assertive (do not ignore all failures with ||:).
    • Update the HashJoin::getNonJoinedBlocks exception text to match set-based validation semantics.

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 24, 2026
@groeneai
Copy link
Copy Markdown
Contributor

Hi @alexey-milovidov, the fix direction is correct but the NameSet approach over-deduplicates. Here's why the 8 tests fail and a suggested correction.

Why the 8 Tests Fail

The 8 failing tests (00370_duplicate_columns_in_subqueries, 02497_storage_join_right_assert, etc.) contain queries with within-side duplicate column names — e.g. the same subquery column appearing twice on the left side:

SELECT 'y', * FROM (SELECT count('y'), count('y'), 2 AS x) AS t1 
RIGHT JOIN (SELECT count('x'), count('y'), 3 AS x) AS t2 ON t1.x = t2.x;

Here count('y') appears twice in t1 (left side). The result_sample_block legitimately has both columns (count: ≥5). But NameSet deduplicates count('y') to a single entry → expected_names.size() < result_sample_block.columns() → assertion fires incorrectly.

The Root Difference

  • STID 2980-385f case: expected > actual — same name in left AND right → result deduplicates cross-side → expected overcounts
  • Failing tests: expected < actual — same name twice within left → NameSet undercounts

Suggested Fix: Cross-Side Overlap Only

Instead of using NameSet for everything, subtract only the cross-side overlap (names that appear in both the left side and the right side / keys):

size_t left_columns_count = left_sample_block.columns();
if (canRemoveColumnsFromLeftBlock())
    left_columns_count = table_join->getOutputColumns(JoinTableSide::Left).size();

// Build unique name set for left side only
NameSet left_names;
if (canRemoveColumnsFromLeftBlock())
    for (const auto & col : table_join->getOutputColumns(JoinTableSide::Left))
        left_names.insert(col.name);
else
    for (const auto & name : left_sample_block.getNames())
        left_names.insert(name);

// Count names that appear in BOTH left and (right_keys ∪ to_add)
size_t cross_side_overlap = 0;
for (size_t i = 0; i < required_right_keys.columns(); ++i)
    if (left_names.count(required_right_keys.getByPosition(i).name))
        ++cross_side_overlap;
for (size_t i = 0; i < sample_block_with_columns_to_add.columns(); ++i)
    if (left_names.count(sample_block_with_columns_to_add.getByPosition(i).name))
        ++cross_side_overlap;

size_t expected_columns_count = left_columns_count
    + required_right_keys.columns()
    + sample_block_with_columns_to_add.columns()
    - cross_side_overlap;

This:

  • Preserves within-left duplicates in the count → no false positive for {count('y'), count('y'), x}
  • Correctly subtracts when the same name appears in both left and right → fixes STID 2980-385f
  • Minimal change from the original logic — just adds the overlap subtraction

alexey-milovidov and others added 2 commits March 28, 2026 16:40
The previous fix compared `expected_names.size()` (unique) against
`result_sample_block.columns()` (raw count including duplicates).
When duplicate column names exist in the result block (e.g., after
join side swap), `result_sample_block` also has duplicates, so the
comparison was still incorrect.

Compare unique name sets on both sides instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
<= toInt64(materialize(9223372036854775807))
FORMAT Null
SETTINGS enable_analyzer = 1
" >/dev/null 2>&1 ||:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression check is currently ineffective: both repro queries end with >/dev/null 2>&1 ||:, so the test passes even when HashJoin::getNonJoinedBlocks throws the same LOGICAL_ERROR this PR is fixing.

Could we assert that this specific error is absent instead of ignoring all failures? For example, capture stderr/stdout and fail if it contains Unexpected number of columns in result sample block / LOGICAL_ERROR.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request Mar 29, 2026
After join side swapping (e.g., LEFT ANTI → RIGHT ANTI via GLOBAL NOT IN),
column names can appear in both the left block and
sample_block_with_columns_to_add. AddedColumns correctly skips right-side
columns whose renamed name already exists in the left block, but the
assertion in getNonJoinedBlocks computed the expected column count as a
naive sum (left + right_keys + columns_to_add) without accounting for
this cross-side overlap.

The fix counts right-side columns whose renamed name appears in the left
block (matching the exact deduplication logic in AddedColumns) and
subtracts this overlap from the expected count. Unlike the NameSet-based
approach in PR ClickHouse#100617, this preserves within-left duplicate column names
which are legitimate in queries with subqueries.

Fixes STID 2980-385f — P0 recurring AST fuzzer crash hitting 20+ PRs
and master over the last 12 days.
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Mar 29, 2026
After join side swapping (e.g., LEFT ANTI → RIGHT ANTI via GLOBAL NOT IN),
column names can appear in both the left block and
sample_block_with_columns_to_add. AddedColumns correctly skips right-side
columns whose renamed name already exists in the left block, but the
assertion in getNonJoinedBlocks computed the expected column count as a
naive sum (left + right_keys + columns_to_add) without accounting for
this cross-side overlap.

The fix counts right-side columns whose renamed name appears in the left
block (matching the exact deduplication logic in AddedColumns) and
subtracts this overlap from the expected count. Unlike the NameSet-based
approach in PR ClickHouse#100617, this preserves within-left duplicate column names
which are legitimate in queries with subqueries.

Fixes STID 2980-385f — P0 recurring AST fuzzer crash hitting 20+ PRs
and master over the last 12 days.
@@ -1721,7 +1741,7 @@ HashJoin::getNonJoinedBlocks(const Block & left_sample_block, const Block & resu

throw Exception(ErrorCodes::LOGICAL_ERROR,
"Unexpected number of columns in result sample block: {} expected {} ([{}] = [{}] + [{}] + [{}])",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check above now compares name sets, not raw column counts, but this exception still says Unexpected number of columns and prints a pseudo-sum ([left] + [right_keys] + [columns_to_add]) that is no longer mathematically meaningful after deduplication.

Could we reword it to reflect set mismatch (for example, Unexpected columns in result sample block) and print the expected name set directly? That will make future debugging much less confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: Unexpected number of columns in result sample block: A expected B ([C] = [D] + [E] + [F]) (STID: 2980-385f)

2 participants