Skip to content

[data] returning iterators from map_groups #57935

@richardliaw

Description

@richardliaw

Description

Currently, you cannot return a generator object from map_groups UDF, which means if you have a large output of your UDF, you can overwhelm heap memory and object store (even if the output is shardable).

Ideally map_groups should be similar to map_batches, where you can return a Batch (dict numpy, pa.table, pandas df) or Iterator[Batch].

Use case

Right now in one of my pipelines I am creating a large O(n^2) output, which could really benefit from having an iterable output, instead of needing to dump everything in bulk.

def create_iterator(batch):
    for i in range(10):
        yield batch
ds.groupby(keys).map_groups(create_iterator)

Metadata

Metadata

Labels

dataRay Data-related issuesenhancementRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RaytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions