Problem
QueryPlanner._can_push_op_to_sql refuses to push ORDER BY to SQL when no LIMIT follows. The intent (per the in-code comment) is to protect remote tables from full-table sorts. But the rule also fires for local from_df sources, where the data is already in the client process: there is no transport cost, and chDB's parallel ORDER BY is consistently faster than pandas single-threaded sort_values on large frames.
Result: the typical pattern ds[mask].sort_values([...]) produces a 3-segment hybrid plan (chDB filter → pandas sort), losing the chDB sort advantage.
Repro
import pandas as pd, numpy as np
from datastore import DataStore
rng = np.random.default_rng(7)
n = 10_000_000
pdf = pd.DataFrame({
"date": ["2026-05-22"] * n,
"code": np.array([f"{i:06d}" for i in rng.integers(1, 2001, size=n)]),
"ix": np.arange(n, dtype=np.int64),
"flag": rng.integers(0, 2, size=n, dtype=np.int8),
})
ds = DataStore.from_df(pdf)
ds[ds.flag == 1].sort_values(["date", "code", "ix"]).to_pandas()
The planner logs [ORDER BY] Skipping push to SQL: no LIMIT follows, then runs the sort in pandas in segment 3.
Measured impact (chdb 4.1.8, pandas 3.0.3)
| rows |
pandas native |
DataStore (today) |
chDB integrated SQL |
DataStore overhead |
| 1M |
91 ms |
317 ms |
303 ms |
small |
| 5M |
761 ms |
1935 ms |
1414 ms |
+37% |
| 10M |
1786 ms |
4273 ms |
2830 ms |
+51% |
| 20M |
4131 ms |
9278 ms |
5690 ms |
+63% |
DataStore is consistently ~50% slower than the equivalent single-query chDB SQL because it skips the parallel ORDER BY.
Proposal
For an in-process PythonTableFunction source (the from_df case), push unbounded ORDER BY to SQL. Remote sources keep the current cost-aware behavior. connection.query_df already appends _row_id as a tie-breaker, so stable sort + pandas index restoration still work.
PR: #586
Problem
QueryPlanner._can_push_op_to_sqlrefuses to push ORDER BY to SQL when no LIMIT follows. The intent (per the in-code comment) is to protect remote tables from full-table sorts. But the rule also fires for localfrom_dfsources, where the data is already in the client process: there is no transport cost, and chDB's parallel ORDER BY is consistently faster than pandas single-threadedsort_valueson large frames.Result: the typical pattern
ds[mask].sort_values([...])produces a 3-segment hybrid plan (chDB filter → pandas sort), losing the chDB sort advantage.Repro
The planner logs
[ORDER BY] Skipping push to SQL: no LIMIT follows, then runs the sort in pandas in segment 3.Measured impact (chdb 4.1.8, pandas 3.0.3)
DataStore is consistently ~50% slower than the equivalent single-query chDB SQL because it skips the parallel ORDER BY.
Proposal
For an in-process
PythonTableFunctionsource (thefrom_dfcase), push unbounded ORDER BY to SQL. Remote sources keep the current cost-aware behavior.connection.query_dfalready appends_row_idas a tie-breaker, so stable sort + pandas index restoration still work.PR: #586