-
Notifications
You must be signed in to change notification settings - Fork 174
Closed
Description
This is the initial draft. I will complete it shortly.
State of Iterator is attached to each IterDataPipe instance. This is super useful for:
- Determinism
- Snapshotting
- Benchmarking -> It becomes easier to register each DataPipe since they have different ID in the graph.
Implementation Options:
- Each DataPipe has an attribute of
_iteratoras the place holder for__iter__calls. - Implement
__next__. (My Preference)- It would make the instance pickable. Previously generator function (
__iter__) is not picklable -> Help multiprocessing and snapshotting) __iter__returnself(Forker(self)may be another option, not 100% sure)- IMO, this is super useful as we can track the number of
__next__call to do a fast forward. The state of iteration is attached to DataPipe instance, rather than a temporary instance created from__iter__, which we couldn't track the internal state. (We can easily track states like RNG, iteration number, buffer, etc. as they are going to be attached toselfinstance) - As source DataPipe is attached to each DataPipe, but the actual iteration happens on Iterator level. The graph constructed by DataLoaderV2 doesn't match the actual execution graph.
- It would make the instance pickable. Previously generator function (
DataLoader trigger Error if there are two DataPipe instance with same id in the graph. (Another option is DataLoader do an automatically fork)
Users should use Forker for each DataPipe want to have single DataPipe twice in the graph.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels