DataStore: unbounded sort_values() on local from_df source falls back to pandas, missing chDB parallel sort

## Problem

`QueryPlanner._can_push_op_to_sql` refuses to push ORDER BY to SQL when no LIMIT follows. The intent (per the in-code comment) is to protect remote tables from full-table sorts. But the rule also fires for **local `from_df` sources**, where the data is already in the client process: there is no transport cost, and chDB's parallel ORDER BY is consistently faster than pandas single-threaded `sort_values` on large frames.

Result: the typical pattern `ds[mask].sort_values([...])` produces a 3-segment hybrid plan (chDB filter → pandas sort), losing the chDB sort advantage.

## Repro

```python
import pandas as pd, numpy as np
from datastore import DataStore

rng = np.random.default_rng(7)
n = 10_000_000
pdf = pd.DataFrame({
    "date": ["2026-05-22"] * n,
    "code": np.array([f"{i:06d}" for i in rng.integers(1, 2001, size=n)]),
    "ix":   np.arange(n, dtype=np.int64),
    "flag": rng.integers(0, 2, size=n, dtype=np.int8),
})
ds = DataStore.from_df(pdf)
ds[ds.flag == 1].sort_values(["date", "code", "ix"]).to_pandas()
```

The planner logs `[ORDER BY] Skipping push to SQL: no LIMIT follows`, then runs the sort in pandas in segment 3.

## Measured impact (chdb 4.1.8, pandas 3.0.3)

| rows | pandas native | DataStore (today) | chDB integrated SQL | DataStore overhead |
|---|---|---|---|---|
| 1M | 91 ms | 317 ms | 303 ms | small |
| 5M | 761 ms | 1935 ms | 1414 ms | +37% |
| 10M | 1786 ms | 4273 ms | 2830 ms | +51% |
| 20M | 4131 ms | 9278 ms | 5690 ms | +63% |

DataStore is consistently ~50% slower than the equivalent single-query chDB SQL because it skips the parallel ORDER BY.

## Proposal

For an in-process `PythonTableFunction` source (the `from_df` case), push unbounded ORDER BY to SQL. Remote sources keep the current cost-aware behavior. `connection.query_df` already appends `_row_id` as a tie-breaker, so stable sort + pandas index restoration still work.

PR: #586

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataStore: unbounded sort_values() on local from_df source falls back to pandas, missing chDB parallel sort #585

Problem

Repro

Measured impact (chdb 4.1.8, pandas 3.0.3)

Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

rows	pandas native	DataStore (today)	chDB integrated SQL	DataStore overhead
1M	91 ms	317 ms	303 ms	small
5M	761 ms	1935 ms	1414 ms	+37%
10M	1786 ms	4273 ms	2830 ms	+51%
20M	4131 ms	9278 ms	5690 ms	+63%

Uh oh!

DataStore: unbounded sort_values() on local from_df source falls back to pandas, missing chDB parallel sort #585

Description

Problem

Repro

Measured impact (chdb 4.1.8, pandas 3.0.3)

Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions