-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[data] returning iterators from map_groups #57935
Copy link
Copy link
Closed
Labels
dataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to RaytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
Description
Currently, you cannot return a generator object from map_groups UDF, which means if you have a large output of your UDF, you can overwhelm heap memory and object store (even if the output is shardable).
Ideally map_groups should be similar to map_batches, where you can return a Batch (dict numpy, pa.table, pandas df) or Iterator[Batch].
Use case
Right now in one of my pipelines I am creating a large O(n^2) output, which could really benefit from having an iterable output, instead of needing to dump everything in bulk.
def create_iterator(batch):
for i in range(10):
yield batch
ds.groupby(keys).map_groups(create_iterator)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to RaytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)