Skip to content

[RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton) #45

@ejguan

Description

@ejguan

This is the initial draft. I will complete it shortly.

State of Iterator is attached to each IterDataPipe instance. This is super useful for:

  • Determinism
  • Snapshotting
  • Benchmarking -> It becomes easier to register each DataPipe since they have different ID in the graph.

Implementation Options:

  • Each DataPipe has an attribute of _iterator as the place holder for __iter__ calls.
  • Implement __next__. (My Preference)
    • It would make the instance pickable. Previously generator function (__iter__) is not picklable -> Help multiprocessing and snapshotting)
    • __iter__ return self (Forker(self) may be another option, not 100% sure)
    • IMO, this is super useful as we can track the number of __next__ call to do a fast forward. The state of iteration is attached to DataPipe instance, rather than a temporary instance created from __iter__, which we couldn't track the internal state. (We can easily track states like RNG, iteration number, buffer, etc. as they are going to be attached to self instance)
    • As source DataPipe is attached to each DataPipe, but the actual iteration happens on Iterator level. The graph constructed by DataLoaderV2 doesn't match the actual execution graph.

DataLoader trigger Error if there are two DataPipe instance with same id in the graph. (Another option is DataLoader do an automatically fork)
Users should use Forker for each DataPipe want to have single DataPipe twice in the graph.

cc: @VitalyFedyunin @NivekT

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions