[DataLoader] Add __len__ to DataLoader2#728
[DataLoader] Add __len__ to DataLoader2#728NivekT wants to merge 9 commits intogh/NivekT/87/basefrom
Conversation
[ghstack-poisoned]
NivekT
left a comment
There was a problem hiding this comment.
Forgot to post these comments
torchdata/dataloader2/dataloader2.py
Outdated
| return DataLoader2Iterator(self, self.valid_iterator_id) | ||
|
|
||
| def __len__(self): | ||
| return len(self.datapipe) |
There was a problem hiding this comment.
Does delegating to datapipe make sense? My only concern is that when __iter__ starts, ReadingService makes some changes to self.datapipe, and that can potentially change the __len__, though I think we should be fine to expect ReadingService not to? What about in distributed cases?
If RS will change the length, we might want to delegate to ReadingService instead?
There was a problem hiding this comment.
I think we still need to rely on self.datapipe because RS won't have aware of length of self.datapipe before __iter__ is invoked (as reading_service.initialize(datapipe)). And, after initialize(datapipe), the self.datapipe should have been modified and the corresponding changes should be reflected at length.
But, this behavior becomes super weird to me. The length varies before and after __iter__ is invoked.
There was a problem hiding this comment.
Could we do something like adapting self.datapipe when __len__ is called but self.datapipe has not been modified by self.datapipe?
There was a problem hiding this comment.
I am going to change it to len(self._datapipe_before_reading_service_adapt) because I am guessing that what you mean. It makes sense since that variable also shouldn't be impacted by ReadingService but it would be impacted by adapters within __init__, but that is a little bit dependent on the implementation of ReadingService. I think if you .apply_sharding within child process and not the main one, the length should not change.
Relevant issue about what is the proper __len__ to return: #533
|
Offline discussion:
|
|
Please re-request review when discussed changes applied. |
Adding `__len__` to `DataLoader2`. I think users would expect `DataLoader2` to have `len` available since the original one provides it. I also believe this is reasonable to provide. See inline comments. We should discuss the details and if this makes sense. Fixes #549 [ghstack-poisoned]
| self.assertEqual(10, len(data_loader_sharding)) | ||
|
|
||
| # Functional Test: Case with filter | ||
| data_loader_filter: DataLoader2 = DataLoader2(datapipe=filter_dp, reading_service=rs) |
There was a problem hiding this comment.
Since rs is being re-used, this would've failed without the previous PR that deepcopies.
| try: | ||
| self._length = len(datapipe) | ||
| except TypeError: | ||
| pass |
There was a problem hiding this comment.
Two things:
- This is taking the
lenbefore anyapply_shardinghappens. I think that is the right number that we want to return for multiprocessing. Agree? - This is only ran once (instead of per iteration) as I don't think users want different length per iteration. Is this a reasonable assumption?
Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 [ghstack-poisoned]
Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743) [ghstack-poisoned]
|
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
| try: | ||
| self._length = len(datapipe) | ||
| except TypeError: | ||
| self._length = None |
There was a problem hiding this comment.
Is this the right time/place to request length for distributed?
Adding `__len__` to `DataLoader2`. The main idea is to delegate to reading service to determine what the right length is; this will allow returning length only after the DataPipe is initialized (e.g. sharding is applied) and the correct length to be returned. See inline comments. We should discuss the details and if this makes sense. Fixes #549 Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743) [ghstack-poisoned]
|
@NivekT Is there any update? I think it's very helpful for using Trainers like pytorch-lightning or huggingface Transformers. |
|
Hi @NivekT! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Closing. Feel free to re-open if someone else would like to work on this. |
Stack from ghstack:
Adding
__len__toDataLoader2. The main idea is to delegate to reading service to determine what the right length is; this will allow returning length only after the DataPipe is initialized (e.g. sharding is applied) and the correct length to be returned.For
MultiprocessingReadingService, we will likely have methods to return one of these or both (undecided):nelements:{worker_id: int = worker_dp_length: int}n = num_workerstotal_len: intthat is the sum of all theworker_dp_lengthFor
DistributedReadingService, it gets more complicated and we will discuss that separately.The implementation currently doesn't reflect those ideas. This will be updated but currently is not a priority.
Fixes #549
Differential Revision: D38999743