Skip to content

[datasets] Support advanced windowing (e.g., by bytes) #18100

@richardliaw

Description

@richardliaw

Describe your feature request

For my use case, I'd like to be able to read / shuffle large windows of my dataset at a time, to feed data into my multi-node training job.

Constraints:

  • Data >> available working memory/disk

Ideally:

  • I can specify the amount of data to read into memory at a time, since rows can have varying sizes.
  • I can pipeline the reading/shuffling/ingest so that I don't waste GPU resources. Sliding windows might be nice here.

Currently, there's a way to do basic windowing (with some workarounds/stability issues), but the ideal requests are not easy to implement.

cc @clarkzinzow @wuisawesome @ericl @yuduber

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions