Skip to content

feat(datastore): mirror pandas iteration on LazyGroupBy / SeriesGroupBy#582

Merged
wudidapaopao merged 7 commits into
chdb-io:mainfrom
wudidapaopao:support_groupby_iter
May 25, 2026
Merged

feat(datastore): mirror pandas iteration on LazyGroupBy / SeriesGroupBy#582
wudidapaopao merged 7 commits into
chdb-io:mainfrom
wudidapaopao:support_groupby_iter

Conversation

@wudidapaopao

@wudidapaopao wudidapaopao commented May 25, 2026

Copy link
Copy Markdown
Contributor

Closes #581.

Summary

LazyGroupBy had no __iter__, so for k, g in ds.groupby([...]): fell back to __getitem__(0) and raised TypeError: Expected str or list, got int — a misleading message hiding a missing feature. ColumnExpr.__iter__ also dropped groupby context: for x in gb['col']: yielded raw values instead of pandas SeriesGroupBy (key, sub_series) pairs.

Changes

  • datastore/groupby.pyLazyGroupBy adds __iter__, __len__, __contains__, get_group(name, obj=None), groups, indices. All delegate to a shared _pandas_groupby() helper that respects sort / dropna / selected_columns; single-col groupby uses scalar by to match the gb['col'] scalar-key convention. ngroups refactored onto the helper too.
  • datastore/column_expr.py__iter__ yields (key, sub_series) when _groupby_fields is set; plain ds['col'] iteration unchanged.
  • __getitem__ errors — clearer TypeError for gb[<int>] (suggests iter / get_group) and gb[<other>] (suggests str / list of str).

Tests

datastore/tests/test_groupby_iteration.py44 mirror-pattern tests across 10 TestCases:

  • single/multi-column iter with scalar / tuple keys, index preservation
  • sort=True/False, dropna=True/False, as_index=False, column selection
  • get_group single / multi / column-selected / missing-key / legacy obj= kwarg
  • .groups / .indices for single & multi column
  • len(gb) / key in gb
  • datetime keys, NaN keys, empty / single-row / all-same-value inputs
  • SeriesGroupBy iter + clear error for iter on computed expressions
  • gb[<int|float|None>] error path coverage

All 44 pass locally. Verified no regression on adjacent column_expr / groupby test files (178 passed).

Verification

for (date, code), group in ds.groupby(['date', 'code']):
    print(date, code, len(group))   # works, mirrors pandas

for k, s in ds.groupby('cat')['v']:
    print(k, s.tolist())            # SeriesGroupBy semantics

@wudidapaopao wudidapaopao force-pushed the support_groupby_iter branch from 5654ef1 to 5ae7dee Compare May 25, 2026 10:11
@wudidapaopao wudidapaopao force-pushed the support_groupby_iter branch from 5ae7dee to d90743f Compare May 25, 2026 10:29
… only

transform() and similar groupby-context methods re-copy _expr=Field and
_groupby_fields onto an op-mode result. The previous judgement
_groupby_fields and isinstance(_expr, Field) mis-classified them as
SeriesGroupBy and yielded (key, sub_series) pairs instead of transform
values.

Require also _source / _op_type / _agg_func_name to all be None so only
the pure ds.groupby(...)[col] ColumnExpr triggers SeriesGroupBy iter.

Regression test in test_groupby_iteration.py covers groupby + transform.
The 4 dropna+NaN tests were dropped in earlier commits to unblock CI on
pandas 2.x, which has version-specific bugs:
- get_group(np.nan) / get_group((x, np.nan)) raise KeyError (NaN!=NaN
  in hash lookup), fixed in pandas 3.x
- groupby(NaN-col, dropna=False).groups raises ValueError 'Categorical
  categories cannot be null', fixed in pandas 3.x

DataStore mirrors pandas via _pandas_groupby(), so it inherits whatever
pandas does. Skip-on-pandas2 keeps pandas 3.x coverage instead of losing
the tests entirely.
@wudidapaopao wudidapaopao merged commit ae0b20d into chdb-io:main May 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataStore: LazyGroupBy is not iterable; gb['col'] iteration drops grouping

1 participant