Push unbounded ORDER BY to SQL for local from_df source by wudidapaopao · Pull Request #586 · chdb-io/chdb

wudidapaopao · 2026-05-27T12:31:27Z

Fixes #585.

What

In QueryPlanner._can_push_op_to_sql, the unbounded-sort guard now only applies to non-local sources. For an in-process PythonTableFunction (from DataStore.from_df), unbounded ORDER BY is pushed to SQL.

Threading: core.py detects local_source = self._source_df is not None or isinstance(self._table_function, PythonTableFunction) and passes it to plan_segments → _can_push_op_to_sql.

Remote table sources, dict-source DataStore({...}), and the existing GROUP-BY-strip / first-last / LIMIT branches are all untouched.

Why

For local sources there is no transport/serialization cost and chDB parallel ORDER BY beats pandas single-threaded sort_values on large frames. The existing 3-segment hybrid plan (chDB filter → pandas sort) leaves performance on the table — see the table in #585.

After this change, ds[mask].sort_values([...]).to_pandas() produces a single SQL segment and matches chdb.session.query(sql, "DataFrame") performance.

Correctness

connection.query_df already appends _row_id as a tie-breaker (add_row_id_as_tiebreaker), so the push is stable-sort equivalent to pandas' default.
pandas index restoration via _row_id is unchanged.
na_position != 'last' and key= are already caught at the sort_values entry point in pandas_compat.py and fall back to pandas before the planner sees them.

Tests

New class TestLocalSourceUnboundedOrderByPushed in test_orderby_cost_awareness.py:

ORDER BY is present in the SQL segment for from_df source.
Result matches pandas for: ascending/descending, multi-col, mixed ascending=[True,False], filter+sort chain.
Pre-GROUP BY sort is still stripped (existing behavior, sanity check).
Dict-source DataStore({...}) still does not push (sanity check that the change is scoped).

Full local test run: 11006 passed, 16 skipped, 84 xfailed, 5 xpassed.

Measured impact (chdb 4.1.8 / pandas 3.0.3, 5M rows, repeated runs)

	before	after
DataStore filter+sort	~1930 ms	~1530 ms (matches direct chDB SQL)

Larger gains at 10M / 20M rows (see #585 table).

The cost-aware ORDER BY pushdown rule was written for remote tables, where sorting a huge table on the server before streaming can hurt. For an in-process PythonTableFunction (from_df) source there is no network/serialization cost and chDB parallel sort beats pandas single-threaded sort_values, so unbounded ORDER BY can be pushed down profitably. Plumb a `local_source` flag from core to the planner and short-circuit the no-LIMIT check when the source is a PythonTableFunction. Remote sources keep the cost-aware behavior unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…alDtype pandas sorts CategoricalDtype columns by the declared category order, but chDB has no concept of categorical and would sort by string/value literal order, producing a different result. This was already wrong on the existing sort_values().head(N) lazy SQL path; the previous commit widens the trigger to bare sort_values(). Add the dtype detection at the pandas_compat.sort_values entry point so all SQL-pushdown paths see the same fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wudidapaopao mentioned this pull request May 27, 2026

DataStore: unbounded sort_values() on local from_df source falls back to pandas, missing chDB parallel sort #585

Closed

wudidapaopao changed the title ~~feat(datastore): push unbounded ORDER BY to SQL for local from_df source~~ Push unbounded ORDER BY to SQL for local from_df source May 27, 2026

wudidapaopao merged commit 66af328 into chdb-io:main May 27, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Push unbounded ORDER BY to SQL for local from_df source#586

Push unbounded ORDER BY to SQL for local from_df source#586
wudidapaopao merged 2 commits into
chdb-io:mainfrom
wudidapaopao:feat/orderby-pushdown-local-source

wudidapaopao commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wudidapaopao commented May 27, 2026

What

Why

Correctness

Tests

Measured impact (chdb 4.1.8 / pandas 3.0.3, 5M rows, repeated runs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant