Skip to content

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456

Open
tuanpach wants to merge 1 commit intoClickHouse:masterfrom
tuanpach:fix-join-shard-read-in-order-reverse-key
Open

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456
tuanpach wants to merge 1 commit intoClickHouse:masterfrom
tuanpach:fix-join-shard-read-in-order-reverse-key

Conversation

@tuanpach
Copy link
Copy Markdown
Member

@tuanpach tuanpach commented Apr 1, 2026

With query_plan_read_in_order_through_join and query_plan_join_shard_by_pk_ranges
both enabled, buildInputOrderInfo traverses through the JOIN, finds the MergeTree
primary key, and demotes the outer ORDER BY from FullSortingStep to
FinishSortingStep, trusting the storage to emit rows in PK order end-to-end.
optimizeJoinByShards then splits the read into independent PK-range layers. Each
layer emits rows in storage order independently and outputs are concatenated without a
cross-layer sort, producing wrong results (e.g. 4,3,2,1,0,9,8,7,6,5 instead of
0..9 on a DESC-key table with max_threads = 2).

Three changes fix this:

  • Move optimizeJoinByShards to run before optimizeReadInOrder in
    optimizeTreeSecondPass. Previously it ran near applyOrder (line ~499), after
    optimizeReadInOrder had already executed inside traverseQueryPlan (line ~287),
    so the sharding state was never visible when buildInputOrderInfo made its decision.
  • Add a guard in all three buildInputOrderInfo overloads (SortingStep,
    AggregatingStep, DistinctStep): if any JoinStep on the path to the scan has
    isJoinByLayersEnabled(), return early to preserve the outer ORDER BY as
    FullSortingStep.
  • Gate enableJoinByLayers on all_split.layers.size() > 1. With a single layer
    there is no stream interleaving and global PK ordering is preserved, so suppressing
    read-in-order is unnecessary and forces a full sort.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed wrong query results when query_plan_read_in_order_through_join and
query_plan_join_shard_by_pk_ranges are both enabled on a reverse-key (DESC)
MergeTree table.

Closes #100870

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 1, 2026

Workflow [PR], commit [26c8d0a]


AI Review

Summary

This PR fixes a real correctness issue where read-in-order-through-join could incorrectly elide the final sort after join-by-pk-ranges sharding, yielding misordered results on reverse-key MergeTree. The fix is coherent: it reorders optimization passes, guards all relevant buildInputOrderInfo paths (SortingStep, AggregatingStep, DistinctStep), and only marks joins as layered when multiple layers actually exist. The regression test covers the failing path and the additional guarded paths. High-level verdict: ✅ good and ready.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-ci label Apr 1, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 1, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 100.00% (20/20) · Uncovered code

Full report · Diff report

@antaljanosbenjamin antaljanosbenjamin mentioned this pull request Apr 1, 2026
1 task
…ough a sharded JOIN

With `query_plan_read_in_order_through_join` and `query_plan_join_shard_by_pk_ranges`
both enabled, `buildInputOrderInfo` traverses through the JOIN, finds the MergeTree
primary key, and demotes the outer ORDER BY from `FullSortingStep` to
`FinishSortingStep`, trusting the storage to emit rows in PK order end-to-end.
`optimizeJoinByShards` then splits the read into independent PK-range layers. Each
layer emits rows in storage order independently and outputs are concatenated without a
cross-layer sort, producing wrong results (e.g. `4,3,2,1,0,9,8,7,6,5` instead of
`0..9` on a `DESC`-key table with `max_threads = 2`).

Three changes fix this:

1. Move `optimizeJoinByShards` to run before `optimizeReadInOrder` in
   `optimizeTreeSecondPass`. Previously it ran near `applyOrder` (line ~499), after
   `optimizeReadInOrder` had already executed inside `traverseQueryPlan` (line ~287),
   so the sharding state was never visible when `buildInputOrderInfo` made its decision.

2. Add a guard in all three `buildInputOrderInfo` overloads (`SortingStep`,
   `AggregatingStep`, `DistinctStep`): loop over `find_reading_ctx.joins_to_keep_in_order`
   and return early if any `JoinStep` on the path has `isJoinByLayersEnabled()`, preserving
   the outer ORDER BY as `FullSortingStep`.

3. Gate `enableJoinByLayers` — and the companion `convertToPartitionedFinishSorting` and
   `keepLeftPipelineInOrder` calls — on `all_split.layers.size() > 1`. With a single layer
   there is no stream interleaving and global PK ordering is preserved, so suppressing
   read-in-order is unnecessary and forces a full sort.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tuanpach tuanpach force-pushed the fix-join-shard-read-in-order-reverse-key branch from de252c9 to 26c8d0a Compare April 2, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: 03668_shard_join_in_reverse_order

1 participant