Skip to content

Fix column count assertion in HashJoin::getNonJoinedBlocks#101066

Open
groeneai wants to merge 2 commits intoClickHouse:masterfrom
groeneai:fix/hashjoin-column-overlap-assertion
Open

Fix column count assertion in HashJoin::getNonJoinedBlocks#101066
groeneai wants to merge 2 commits intoClickHouse:masterfrom
groeneai:fix/hashjoin-column-overlap-assertion

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Logical error: Unexpected number of columns in result sample block crash in HashJoin::getNonJoinedBlocks when column names overlap between left and right sides after join side swapping (e.g., with GLOBAL NOT IN).

Description

After join side swapping (e.g., LEFT ANTI → RIGHT ANTI via GLOBAL NOT IN), column names can appear in both the left block and sample_block_with_columns_to_add. During joinBlock, AddedColumns correctly skips right-side columns whose renamed name already exists in the left block (see AddedColumns constructor, line 154 of AddedColumns.h). However, the assertion in getNonJoinedBlocks computed the expected column count as a naive sum (left + right_keys + columns_to_add) without accounting for this cross-side overlap, causing a LOGICAL_ERROR crash.

Root cause: The assertion at line 1711 of HashJoin.cpp assumed result_sample_block.columns() == left_columns_count + required_right_keys.columns() + sample_block_with_columns_to_add.columns(), but when column names overlap across sides, the result block has fewer columns because AddedColumns deduplicates them.

Fix: Count right-side columns (from both sample_block_with_columns_to_add and required_right_keys) whose renamed name appears in the left block, matching the exact deduplication logic in AddedColumns, and subtract this overlap from the expected count.

Why not PR #100617's approach: That PR uses NameSet comparison which also deduplicates within-left duplicates. Queries with subqueries can legitimately produce duplicate column names on the same side (e.g., __table2 appearing twice in the left block). The NameSet approach causes expected_names.size() < result.columns() for such queries, breaking 7 existing join tests (00577, 03707, 02497, 02267, 00217, 02381, 00370). Our cross-side overlap approach preserves within-left duplicates while correctly handling cross-side overlap.

Impact: STID 2980-385f — P0 recurring AST fuzzer crash hitting 20+ unrelated PRs and master over the last 12 days.

Verification:

  • New regression test 04068_hashjoin_column_overlap_assertion — 20/20 passes with randomization
  • All 6 locally-runnable join-duplicate tests pass 60/60 (10 runs each)
  • The 7th test (00217) requires remote shards unavailable locally but is structurally unaffected

🤖 Generated with Claude Code

@groeneai
Copy link
Copy Markdown
Contributor Author

Pre-PR Validation Gate (Session: cron:clickhouse-ci-task-worker:20260329-091500)

a) Deterministic repro? Yes — the AST fuzzer queries from PR #100617's test (GLOBAL NOT IN with join side swap) trigger the LOGICAL_ERROR assertion deterministically in debug builds without the fix.

b) Root cause explained? getNonJoinedBlocks assertion at line 1711 computes expected = left_columns + required_right_keys + columns_to_add without subtracting cross-side overlap. When AddedColumns (line 154 of AddedColumns.h) skips right-side columns whose renamed name exists in the left block, the actual result has fewer columns than the naive sum, causing expected != resultLOGICAL_ERROR.

c) Fix matches root cause? Yes — the fix mirrors the exact AddedColumns deduplication logic: builds a NameSet from left column names, counts right-side columns (from both sample_block_with_columns_to_add and required_right_keys) whose renamedRightColumnName appears in the left set, and subtracts this overlap from the expected count.

d) Test intent preserved? Yes — the assertion still checks exact column count equality. No bounds widened, no tags added, no assertions removed. The only change is correct accounting for cross-side overlap.

e) Both directions demonstrated? Test passes 20/20 with fix. All 6 locally-runnable join-duplicate tests (00577, 03707, 02497, 02267, 02381, 00370) that PR #100617 breaks pass 60/60 with this fix.

@groeneai
Copy link
Copy Markdown
Contributor Author

cc @vdimir @KochetovNicolai — could you review this? It fixes the P0 recurring STID 2980-385f crash in HashJoin::getNonJoinedBlocks by correctly accounting for cross-side column name overlap after join side swapping, using a targeted overlap count instead of the NameSet deduplication approach in #100617 (which breaks 7 join-duplicate tests).

@pufit pufit added the can be tested Allows running workflows for external contributors label Mar 29, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 29, 2026

Workflow [PR], commit [1160719]

Summary:

job_name test_name status info comment
Stress test (amd_msan) failure
Logical error: Shard number is greater than shard count: shard_num=A shard_count=B cluster=C (STID: 5066-457d) FAIL cidb

AI Review

Summary
This PR fixes a real HashJoin::getNonJoinedBlocks invariant mismatch when left/right column names overlap after join-side swapping, and adds a focused stateless regression shell test. I reviewed the modified core code paths and test updates, including interaction with AddedColumns and right-key handling, and did not find new correctness, safety, or compatibility regressions in the submitted changes.

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time

Final Verdict

  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 29, 2026
<= toInt64(materialize(9223372036854775807))
FORMAT Null
SETTINGS enable_analyzer = 1
" 2>&1 | grep -v -c 'LOGICAL_ERROR' > /dev/null
Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh bot Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not reliable for detecting LOGICAL_ERROR.

grep -v -c 'LOGICAL_ERROR' > /dev/null succeeds whenever there is any non-matching line, so a real LOGICAL_ERROR can be masked by other stderr lines (for example Received exception from server/query context lines).

Please make this assertion explicit, e.g.:

... 2>&1 | grep -q 'LOGICAL_ERROR' && exit 1
# or
... 2>&1 | (! grep -q 'LOGICAL_ERROR')

The same pattern appears again below in this file.

After join side swapping (e.g., LEFT ANTI → RIGHT ANTI via GLOBAL NOT IN),
column names can appear in both the left block and
sample_block_with_columns_to_add. AddedColumns correctly skips right-side
columns whose renamed name already exists in the left block, but the
assertion in getNonJoinedBlocks computed the expected column count as a
naive sum (left + right_keys + columns_to_add) without accounting for
this cross-side overlap.

The fix counts right-side columns whose renamed name appears in the left
block (matching the exact deduplication logic in AddedColumns) and
subtracts this overlap from the expected count. Unlike the NameSet-based
approach in PR ClickHouse#100617, this preserves within-left duplicate column names
which are legitimate in queries with subqueries.

Fixes STID 2980-385f — P0 recurring AST fuzzer crash hitting 20+ PRs
and master over the last 12 days.
- Replace NameSet::count() with NameSet::contains() for membership checks
  (clang-tidy readability-container-contains diagnostic)
- Fix test assertion: use `{ ! grep -q 'LOGICAL_ERROR'; }` instead of
  `grep -v -c 'LOGICAL_ERROR' > /dev/null` which could mask LOGICAL_ERROR
  presence when other stderr lines exist
@groeneai groeneai force-pushed the fix/hashjoin-column-overlap-assertion branch from 5b6ba62 to 1160719 Compare March 29, 2026 17:22
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 29, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.20% +0.00%
Functions 24.60% 24.60% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 90.00% (27/30) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants