[DataLoader2] Implementing single iterator constraint#700
[DataLoader2] Implementing single iterator constraint#700NivekT wants to merge 5 commits intogh/NivekT/82/basefrom
Conversation
[ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
ejguan
left a comment
There was a problem hiding this comment.
Let me know if those comments make sense to you.
torchdata/dataloader2/dataloader2.py
Outdated
| """ | ||
| Delegate methods (e.g. `limit`, `pause`, `resume`, etc) to the iterator | ||
| created by the ReadingService and DataPipe. | ||
| """ |
There was a problem hiding this comment.
If we directly delegate to datapipe_iter from dataloader_iter, we don't need this function.
There was a problem hiding this comment.
I am fine with this change, but I would flag that methods such as state_dict and shutdown will not be delegated to DataLoader2. I am going to change the implementation to do that. As long as that is not an issue, we can delegate directly to dataloader_iter.
There was a problem hiding this comment.
Also, do we want to allow users to do dl2.limit(), dl2.resume(), and etc? Or we want them to always invoke those from the iterator?
If we want the former, we will need to keep this method.
There was a problem hiding this comment.
IIRC, users should only invoke such APIs on iterator object. I am fine either.
cc: @Miiira
There was a problem hiding this comment.
I think invoking from one single place (i.e. iterator) rather than multiple places is better and less error-prone.
The internal test is calling .resume on the iterator, so I think it is fine to remove the API from the DataLoader2 class.
There was a problem hiding this comment.
Yes we only need limit, resume on iterator. I'm good with removing this from DataLoader2. Maybe we also want to look at Lightning DataModule train_dataloader return type with this change
Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Changes: * Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not * Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator` * Aside from the behavior related to reset (see below), we expect other behaviors to stay the same. Prior behavior: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # No reset here print(next(it2)) # 1 print(next(it1)) # 2 ``` New behavior with this PR: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator print(next(it2)) # 0 print(next(it1)) # Raises exception, since it1 is no longer valid ``` Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
NivekT
left a comment
There was a problem hiding this comment.
I updated this PR (one fix) and added a test to make through attributes/methods are being passed through. I will be looking at internal tests again shortly.
| class _ReadingServiceWrapper: | ||
| def __init__(self, dp): | ||
| self.dp = dp | ||
|
|
||
| def __iter__(self): | ||
| self.it = iter(self.dp) | ||
| return self | ||
|
|
||
| def __next__(self): | ||
| return next(self.it) | ||
|
|
||
| @staticmethod | ||
| def return_one(): | ||
| return 1 |
There was a problem hiding this comment.
A method that dataloader._datapipe_iter will have
Changes: * Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not * Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator` * Aside from the behavior related to reset (see below), we expect other behaviors to stay the same. Prior behavior: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # No reset here print(next(it2)) # 1 print(next(it1)) # 2 ``` New behavior with this PR: ```python dl = DataLoader2(IterableWrapper(range(10))) it1 = iter(dl) print(next(it1)) # 0 it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator print(next(it2)) # 0 print(next(it1)) # Raises exception, since it1 is no longer valid ``` Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Changes:
DataLoader2by having its__iter__method returns a wrapper that can track if an iterator is valid or not__iter__returnsself, but now it will return an instance of_DataLoader2IteratorPrior behavior:
New behavior with this PR:
Differential Revision: D38262867