Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456
Open
tuanpach wants to merge 1 commit intoClickHouse:masterfrom
Open
Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456tuanpach wants to merge 1 commit intoClickHouse:masterfrom
tuanpach wants to merge 1 commit intoClickHouse:masterfrom
Conversation
Contributor
|
Workflow [PR], commit [26c8d0a] AI ReviewSummaryThis PR fixes a real correctness issue where ClickHouse Rules
Final Verdict
|
Contributor
LLVM Coverage Report
Changed lines: 100.00% (20/20) · Uncovered code |
…ough a sharded JOIN With `query_plan_read_in_order_through_join` and `query_plan_join_shard_by_pk_ranges` both enabled, `buildInputOrderInfo` traverses through the JOIN, finds the MergeTree primary key, and demotes the outer ORDER BY from `FullSortingStep` to `FinishSortingStep`, trusting the storage to emit rows in PK order end-to-end. `optimizeJoinByShards` then splits the read into independent PK-range layers. Each layer emits rows in storage order independently and outputs are concatenated without a cross-layer sort, producing wrong results (e.g. `4,3,2,1,0,9,8,7,6,5` instead of `0..9` on a `DESC`-key table with `max_threads = 2`). Three changes fix this: 1. Move `optimizeJoinByShards` to run before `optimizeReadInOrder` in `optimizeTreeSecondPass`. Previously it ran near `applyOrder` (line ~499), after `optimizeReadInOrder` had already executed inside `traverseQueryPlan` (line ~287), so the sharding state was never visible when `buildInputOrderInfo` made its decision. 2. Add a guard in all three `buildInputOrderInfo` overloads (`SortingStep`, `AggregatingStep`, `DistinctStep`): loop over `find_reading_ctx.joins_to_keep_in_order` and return early if any `JoinStep` on the path has `isJoinByLayersEnabled()`, preserving the outer ORDER BY as `FullSortingStep`. 3. Gate `enableJoinByLayers` — and the companion `convertToPartitionedFinishSorting` and `keepLeftPipelineInOrder` calls — on `all_split.layers.size() > 1`. With a single layer there is no stream interleaving and global PK ordering is preserved, so suppressing read-in-order is unnecessary and forces a full sort. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
de252c9 to
26c8d0a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With
query_plan_read_in_order_through_joinandquery_plan_join_shard_by_pk_rangesboth enabled,
buildInputOrderInfotraverses through the JOIN, finds the MergeTreeprimary key, and demotes the outer ORDER BY from
FullSortingSteptoFinishSortingStep, trusting the storage to emit rows in PK order end-to-end.optimizeJoinByShardsthen splits the read into independent PK-range layers. Eachlayer emits rows in storage order independently and outputs are concatenated without a
cross-layer sort, producing wrong results (e.g.
4,3,2,1,0,9,8,7,6,5instead of0..9on aDESC-key table withmax_threads = 2).Three changes fix this:
optimizeJoinByShardsto run beforeoptimizeReadInOrderinoptimizeTreeSecondPass. Previously it ran nearapplyOrder(line ~499), afteroptimizeReadInOrderhad already executed insidetraverseQueryPlan(line ~287),so the sharding state was never visible when
buildInputOrderInfomade its decision.buildInputOrderInfooverloads (SortingStep,AggregatingStep,DistinctStep): if anyJoinStepon the path to the scan hasisJoinByLayersEnabled(), return early to preserve the outer ORDER BY asFullSortingStep.enableJoinByLayersonall_split.layers.size() > 1. With a single layerthere is no stream interleaving and global PK ordering is preserved, so suppressing
read-in-order is unnecessary and forces a full sort.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed wrong query results when
query_plan_read_in_order_through_joinandquery_plan_join_shard_by_pk_rangesare both enabled on a reverse-key (DESC)MergeTreetable.Closes #100870
Documentation entry for user-facing changes