Push unbounded ORDER BY to SQL for local from_df source#586
Merged
wudidapaopao merged 2 commits intoMay 27, 2026
Merged
Conversation
The cost-aware ORDER BY pushdown rule was written for remote tables, where sorting a huge table on the server before streaming can hurt. For an in-process PythonTableFunction (from_df) source there is no network/serialization cost and chDB parallel sort beats pandas single-threaded sort_values, so unbounded ORDER BY can be pushed down profitably. Plumb a `local_source` flag from core to the planner and short-circuit the no-LIMIT check when the source is a PythonTableFunction. Remote sources keep the cost-aware behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alDtype pandas sorts CategoricalDtype columns by the declared category order, but chDB has no concept of categorical and would sort by string/value literal order, producing a different result. This was already wrong on the existing sort_values().head(N) lazy SQL path; the previous commit widens the trigger to bare sort_values(). Add the dtype detection at the pandas_compat.sort_values entry point so all SQL-pushdown paths see the same fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #585.
What
In
QueryPlanner._can_push_op_to_sql, the unbounded-sort guard now only applies to non-local sources. For an in-processPythonTableFunction(fromDataStore.from_df), unbounded ORDER BY is pushed to SQL.Threading:
core.pydetectslocal_source = self._source_df is not None or isinstance(self._table_function, PythonTableFunction)and passes it toplan_segments→_can_push_op_to_sql.Remote table sources, dict-source
DataStore({...}), and the existing GROUP-BY-strip / first-last / LIMIT branches are all untouched.Why
For local sources there is no transport/serialization cost and chDB parallel ORDER BY beats pandas single-threaded
sort_valueson large frames. The existing 3-segment hybrid plan (chDB filter → pandas sort) leaves performance on the table — see the table in #585.After this change,
ds[mask].sort_values([...]).to_pandas()produces a single SQL segment and matcheschdb.session.query(sql, "DataFrame")performance.Correctness
connection.query_dfalready appends_row_idas a tie-breaker (add_row_id_as_tiebreaker), so the push is stable-sort equivalent to pandas' default._row_idis unchanged.na_position != 'last'andkey=are already caught at thesort_valuesentry point inpandas_compat.pyand fall back to pandas before the planner sees them.Tests
New class
TestLocalSourceUnboundedOrderByPushedintest_orderby_cost_awareness.py:from_dfsource.ascending=[True,False], filter+sort chain.DataStore({...})still does not push (sanity check that the change is scoped).Full local test run:
11006 passed, 16 skipped, 84 xfailed, 5 xpassed.Measured impact (chdb 4.1.8 / pandas 3.0.3, 5M rows, repeated runs)
Larger gains at 10M / 20M rows (see #585 table).