Skip to content

Fix explain() mislabeling pushed-down GroupBy/apply as Pandas#576

Merged
auxten merged 1 commit into
mainfrom
fix/explain-label-pushed-groupby-as-chdb
May 21, 2026
Merged

Fix explain() mislabeling pushed-down GroupBy/apply as Pandas#576
auxten merged 1 commit into
mainfrom
fix/explain-label-pushed-groupby-as-chdb

Conversation

@auxten

@auxten auxten commented May 20, 2026

Copy link
Copy Markdown
Member

Summary

DataStore.explain() was labeling GroupBy / groupby().apply() ops as 🐼 [Pandas] even when they were clearly pushed down to chDB — the segment header said [chDB] and the generated SQL contained GROUP BY, but the per-op line lied.

Root cause

The base LazyOp.execution_engine() contract (lazy_ops.py:113-126) requires returning the canonical literals "chDB" or "Pandas". DataStore.explain() (core.py:801-816) only branches on "chDB"; anything else falls into the Pandas branch.

Two implementations returned the non-canonical literal "SQL":

  • LazyGroupByAgg.execution_engine() always returned "SQL".
  • LazyApply.execution_engine() returned "SQL" in its pushable branch.

So pushed-down GroupBy and apply ops silently rendered as Pandas. Display-only bug — execution path was unaffected; no consumer was matching on "SQL".

Reproduction (before fix)

ds = DataStore.from_file('amazon_sample.parquet')
res = (
    ds[ds['verified_purchase'] == True]
    .groupby('product_category')
    .agg({'star_rating': ['mean', 'count']})
)
res.explain()

Before:

    Segment 1 [chDB] (from source): Operations 2-5
 [2] 🚀 [chDB] WHERE: "verified_purchase" = TRUE
 [3] 🐼 [Pandas] GroupBy(['product_category']).{'star_rating': ['mean', 'count']}()   <-- wrong
 ...
SELECT "product_category", avg("star_rating") AS "mean", ...
  FROM file('amazon_sample.parquet', 'Parquet') WHERE ... GROUP BY ...                <-- chDB!

After:

    Segment 1 [chDB] (from source): Operations 2-3
 [2] 🚀 [chDB] WHERE: "verified_purchase" = TRUE
 [3] 🚀 [chDB] GroupBy(['product_category']).{'star_rating': ['mean', 'count']}()

Changes

  • datastore/lazy_ops.py: LazyGroupByAgg.execution_engine() and LazyApply.execution_engine() now return "chDB" / "Pandas" based on can_push_to_sql().
  • datastore/tests/test_groupby_apply_sql_pushdown.py: two existing assertions had baked in the buggy "SQL" literal — updated to "chDB".
  • datastore/tests/test_explain_method.py: new TestExplainEngineLabelingForPushedOps class with 5 regression tests covering the original user scenario (agg dict-of-list), simple .mean() agg, the non-pushable case (must stay Pandas), groupby().apply(sum) pushdown, and a direct unit test on the canonical-literal contract.

Test plan

  • pytest datastore/tests/test_explain_method.py — 22 passed
  • pytest datastore/tests/test_groupby_apply_sql_pushdown.py — 28 passed
  • pytest test_explain_segmented_execution.py test_comprehensive_segmented_execution.py test_sql_vs_pandas_filter.py test_lazy_chain_engine_verification.py test_lazy_execution.py test_case_when.py test_nullable_sql_pushdown.py — 203 passed, 1 xfailed (pre-existing, unrelated)
  • Manually verified the original user scenario now labels GroupBy as [chDB].

LazyGroupByAgg.execution_engine() returned the non-canonical literal
"SQL", and LazyApply.execution_engine() did the same in its pushable
branch. DataStore.explain() only recognizes the canonical "chDB" /
"Pandas" literals (as documented on the base LazyOp), so "SQL" silently
fell through to the Pandas branch. As a result the explain output
labeled GroupBy/apply ops as Pandas even when the segment header said
chDB and the generated SQL clearly contained GROUP BY pushed to chDB.

Both implementations now return "chDB" / "Pandas" based on
can_push_to_sql(). Updated two existing assertions that had baked in
the buggy "SQL" literal, and added a regression test class covering
the original user scenario (agg with dict-of-list), simple agg, the
non-pushable case, groupby().apply(sum), and a direct unit test on
LazyGroupByAgg.execution_engine().
@auxten auxten merged commit ff96c27 into main May 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant