Skip to content

Push unbounded ORDER BY to SQL for local from_df source#586

Merged
wudidapaopao merged 2 commits into
chdb-io:mainfrom
wudidapaopao:feat/orderby-pushdown-local-source
May 27, 2026
Merged

Push unbounded ORDER BY to SQL for local from_df source#586
wudidapaopao merged 2 commits into
chdb-io:mainfrom
wudidapaopao:feat/orderby-pushdown-local-source

Conversation

@wudidapaopao

Copy link
Copy Markdown
Contributor

Fixes #585.

What

In QueryPlanner._can_push_op_to_sql, the unbounded-sort guard now only applies to non-local sources. For an in-process PythonTableFunction (from DataStore.from_df), unbounded ORDER BY is pushed to SQL.

Threading: core.py detects local_source = self._source_df is not None or isinstance(self._table_function, PythonTableFunction) and passes it to plan_segments_can_push_op_to_sql.

Remote table sources, dict-source DataStore({...}), and the existing GROUP-BY-strip / first-last / LIMIT branches are all untouched.

Why

For local sources there is no transport/serialization cost and chDB parallel ORDER BY beats pandas single-threaded sort_values on large frames. The existing 3-segment hybrid plan (chDB filter → pandas sort) leaves performance on the table — see the table in #585.

After this change, ds[mask].sort_values([...]).to_pandas() produces a single SQL segment and matches chdb.session.query(sql, "DataFrame") performance.

Correctness

  • connection.query_df already appends _row_id as a tie-breaker (add_row_id_as_tiebreaker), so the push is stable-sort equivalent to pandas' default.
  • pandas index restoration via _row_id is unchanged.
  • na_position != 'last' and key= are already caught at the sort_values entry point in pandas_compat.py and fall back to pandas before the planner sees them.

Tests

New class TestLocalSourceUnboundedOrderByPushed in test_orderby_cost_awareness.py:

  • ORDER BY is present in the SQL segment for from_df source.
  • Result matches pandas for: ascending/descending, multi-col, mixed ascending=[True,False], filter+sort chain.
  • Pre-GROUP BY sort is still stripped (existing behavior, sanity check).
  • Dict-source DataStore({...}) still does not push (sanity check that the change is scoped).

Full local test run: 11006 passed, 16 skipped, 84 xfailed, 5 xpassed.

Measured impact (chdb 4.1.8 / pandas 3.0.3, 5M rows, repeated runs)

before after
DataStore filter+sort ~1930 ms ~1530 ms (matches direct chDB SQL)

Larger gains at 10M / 20M rows (see #585 table).

The cost-aware ORDER BY pushdown rule was written for remote tables,
where sorting a huge table on the server before streaming can hurt.
For an in-process PythonTableFunction (from_df) source there is no
network/serialization cost and chDB parallel sort beats pandas
single-threaded sort_values, so unbounded ORDER BY can be pushed
down profitably.

Plumb a `local_source` flag from core to the planner and short-circuit
the no-LIMIT check when the source is a PythonTableFunction.

Remote sources keep the cost-aware behavior unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wudidapaopao wudidapaopao changed the title feat(datastore): push unbounded ORDER BY to SQL for local from_df source Push unbounded ORDER BY to SQL for local from_df source May 27, 2026
…alDtype

pandas sorts CategoricalDtype columns by the declared category order, but
chDB has no concept of categorical and would sort by string/value literal
order, producing a different result.

This was already wrong on the existing sort_values().head(N) lazy SQL
path; the previous commit widens the trigger to bare sort_values(). Add
the dtype detection at the pandas_compat.sort_values entry point so all
SQL-pushdown paths see the same fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wudidapaopao wudidapaopao merged commit 66af328 into chdb-io:main May 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataStore: unbounded sort_values() on local from_df source falls back to pandas, missing chDB parallel sort

1 participant