feat(datastore): mirror pandas iteration on LazyGroupBy / SeriesGroupBy by wudidapaopao · Pull Request #582 · chdb-io/chdb

wudidapaopao · 2026-05-25T10:09:49Z

Closes #581.

Summary

LazyGroupBy had no __iter__, so for k, g in ds.groupby([...]): fell back to __getitem__(0) and raised TypeError: Expected str or list, got int — a misleading message hiding a missing feature. ColumnExpr.__iter__ also dropped groupby context: for x in gb['col']: yielded raw values instead of pandas SeriesGroupBy (key, sub_series) pairs.

Changes

datastore/groupby.py — LazyGroupBy adds __iter__, __len__, __contains__, get_group(name, obj=None), groups, indices. All delegate to a shared _pandas_groupby() helper that respects sort / dropna / selected_columns; single-col groupby uses scalar by to match the gb['col'] scalar-key convention. ngroups refactored onto the helper too.
datastore/column_expr.py — __iter__ yields (key, sub_series) when _groupby_fields is set; plain ds['col'] iteration unchanged.
__getitem__ errors — clearer TypeError for gb[<int>] (suggests iter / get_group) and gb[<other>] (suggests str / list of str).

Tests

datastore/tests/test_groupby_iteration.py — 44 mirror-pattern tests across 10 TestCases:

single/multi-column iter with scalar / tuple keys, index preservation
sort=True/False, dropna=True/False, as_index=False, column selection
get_group single / multi / column-selected / missing-key / legacy obj= kwarg
.groups / .indices for single & multi column
len(gb) / key in gb
datetime keys, NaN keys, empty / single-row / all-same-value inputs
SeriesGroupBy iter + clear error for iter on computed expressions
gb[<int|float|None>] error path coverage

All 44 pass locally. Verified no regression on adjacent column_expr / groupby test files (178 passed).

Verification

for (date, code), group in ds.groupby(['date', 'code']):
    print(date, code, len(group))   # works, mirrors pandas

for k, s in ds.groupby('cat')['v']:
    print(k, s.tolist())            # SeriesGroupBy semantics

… only transform() and similar groupby-context methods re-copy _expr=Field and _groupby_fields onto an op-mode result. The previous judgement _groupby_fields and isinstance(_expr, Field) mis-classified them as SeriesGroupBy and yielded (key, sub_series) pairs instead of transform values. Require also _source / _op_type / _agg_func_name to all be None so only the pure ds.groupby(...)[col] ColumnExpr triggers SeriesGroupBy iter. Regression test in test_groupby_iteration.py covers groupby + transform.

…s3 ok)

The 4 dropna+NaN tests were dropped in earlier commits to unblock CI on pandas 2.x, which has version-specific bugs: - get_group(np.nan) / get_group((x, np.nan)) raise KeyError (NaN!=NaN in hash lookup), fixed in pandas 3.x - groupby(NaN-col, dropna=False).groups raises ValueError 'Categorical categories cannot be null', fixed in pandas 3.x DataStore mirrors pandas via _pandas_groupby(), so it inherits whatever pandas does. Skip-on-pandas2 keeps pandas 3.x coverage instead of losing the tests entirely.

wudidapaopao force-pushed the support_groupby_iter branch from 5654ef1 to 5ae7dee Compare May 25, 2026 10:11

feat(datastore): support pandas iteration on LazyGroupBy / SeriesGroupBy

d90743f

wudidapaopao force-pushed the support_groupby_iter branch from 5ae7dee to d90743f Compare May 25, 2026 10:29

wudidapaopao added 6 commits May 25, 2026 18:45

test(datastore): drop pandas2/3 incompatible nan-tuple get_group test

6634b47

test(datastore): drop nan-key get_group test (pandas2 KeyError, panda…

6b924b2

…s3 ok)

test(datastore): drop nan+dropna_false .groups test (pandas2 bug)

a4c4db8

test(datastore): drop redundant SeriesGroupBy dropna iter test

f89db1b

wudidapaopao merged commit ae0b20d into chdb-io:main May 25, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(datastore): mirror pandas iteration on LazyGroupBy / SeriesGroupBy#582

feat(datastore): mirror pandas iteration on LazyGroupBy / SeriesGroupBy#582
wudidapaopao merged 7 commits into
chdb-io:mainfrom
wudidapaopao:support_groupby_iter

wudidapaopao commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wudidapaopao commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wudidapaopao commented May 25, 2026 •

edited

Loading