Skip to content

DataStore: unbounded sort_values() on local from_df source falls back to pandas, missing chDB parallel sort #585

@wudidapaopao

Description

@wudidapaopao

Problem

QueryPlanner._can_push_op_to_sql refuses to push ORDER BY to SQL when no LIMIT follows. The intent (per the in-code comment) is to protect remote tables from full-table sorts. But the rule also fires for local from_df sources, where the data is already in the client process: there is no transport cost, and chDB's parallel ORDER BY is consistently faster than pandas single-threaded sort_values on large frames.

Result: the typical pattern ds[mask].sort_values([...]) produces a 3-segment hybrid plan (chDB filter → pandas sort), losing the chDB sort advantage.

Repro

import pandas as pd, numpy as np
from datastore import DataStore

rng = np.random.default_rng(7)
n = 10_000_000
pdf = pd.DataFrame({
    "date": ["2026-05-22"] * n,
    "code": np.array([f"{i:06d}" for i in rng.integers(1, 2001, size=n)]),
    "ix":   np.arange(n, dtype=np.int64),
    "flag": rng.integers(0, 2, size=n, dtype=np.int8),
})
ds = DataStore.from_df(pdf)
ds[ds.flag == 1].sort_values(["date", "code", "ix"]).to_pandas()

The planner logs [ORDER BY] Skipping push to SQL: no LIMIT follows, then runs the sort in pandas in segment 3.

Measured impact (chdb 4.1.8, pandas 3.0.3)

rows pandas native DataStore (today) chDB integrated SQL DataStore overhead
1M 91 ms 317 ms 303 ms small
5M 761 ms 1935 ms 1414 ms +37%
10M 1786 ms 4273 ms 2830 ms +51%
20M 4131 ms 9278 ms 5690 ms +63%

DataStore is consistently ~50% slower than the equivalent single-query chDB SQL because it skips the parallel ORDER BY.

Proposal

For an in-process PythonTableFunction source (the from_df case), push unbounded ORDER BY to SQL. Remote sources keep the current cost-aware behavior. connection.query_df already appends _row_id as a tie-breaker, so stable sort + pandas index restoration still work.

PR: #586

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions