Skip to content

[DataLoader] Add __len__ to DataLoader2#728

Closed
NivekT wants to merge 9 commits intogh/NivekT/87/basefrom
gh/NivekT/87/head
Closed

[DataLoader] Add __len__ to DataLoader2#728
NivekT wants to merge 9 commits intogh/NivekT/87/basefrom
gh/NivekT/87/head

Conversation

@NivekT
Copy link
Contributor

@NivekT NivekT commented Aug 11, 2022

Stack from ghstack:

Adding __len__ to DataLoader2. The main idea is to delegate to reading service to determine what the right length is; this will allow returning length only after the DataPipe is initialized (e.g. sharding is applied) and the correct length to be returned.

For MultiprocessingReadingService, we will likely have methods to return one of these or both (undecided):

  1. A Dict of n elements: {worker_id: int = worker_dp_length: int}
    • n = num_workers
    • This will allow users to decide and compute the custom length value that they desire.
  2. A total_len: int that is the sum of all the worker_dp_length
    • This make less sense if users choose to not shard (and duplicate the original data across workers)

For DistributedReadingService, it gets more complicated and we will discuss that separately.

The implementation currently doesn't reflect those ideas. This will be updated but currently is not a priority.

Fixes #549

Differential Revision: D38999743

NivekT added a commit that referenced this pull request Aug 11, 2022
ghstack-source-id: 13fe7e5
Pull Request resolved: #728
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 11, 2022
@NivekT NivekT changed the title [DataLoader2] Add __len__ to DataLoader2 [WIP][DataLoader2] Add __len__ to DataLoader2 Aug 11, 2022
Copy link
Contributor Author

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to post these comments

return DataLoader2Iterator(self, self.valid_iterator_id)

def __len__(self):
return len(self.datapipe)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does delegating to datapipe make sense? My only concern is that when __iter__ starts, ReadingService makes some changes to self.datapipe, and that can potentially change the __len__, though I think we should be fine to expect ReadingService not to? What about in distributed cases?

If RS will change the length, we might want to delegate to ReadingService instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to rely on self.datapipe because RS won't have aware of length of self.datapipe before __iter__ is invoked (as reading_service.initialize(datapipe)). And, after initialize(datapipe), the self.datapipe should have been modified and the corresponding changes should be reflected at length.

But, this behavior becomes super weird to me. The length varies before and after __iter__ is invoked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do something like adapting self.datapipe when __len__ is called but self.datapipe has not been modified by self.datapipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to change it to len(self._datapipe_before_reading_service_adapt) because I am guessing that what you mean. It makes sense since that variable also shouldn't be impacted by ReadingService but it would be impacted by adapters within __init__, but that is a little bit dependent on the implementation of ReadingService. I think if you .apply_sharding within child process and not the main one, the length should not change.

Relevant issue about what is the proper __len__ to return: #533

@NivekT
Copy link
Contributor Author

NivekT commented Aug 12, 2022

Offline discussion:

  • Initialize when __len__ is called. First, check if RS is initialized, if not, initialize RS, then call self.ReadingService.get_datapipe_length(delegating to RS).
  • Need to change interface of RS to support these operations.

@VitalyFedyunin
Copy link
Contributor

Please re-request review when discussed changes applied.

Adding `__len__` to `DataLoader2`. I think users would expect `DataLoader2` to have `len` available since the original one provides it. I also believe this is reasonable to provide.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 18, 2022
ghstack-source-id: bff97f2
Pull Request resolved: #728
self.assertEqual(10, len(data_loader_sharding))

# Functional Test: Case with filter
data_loader_filter: DataLoader2 = DataLoader2(datapipe=filter_dp, reading_service=rs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since rs is being re-used, this would've failed without the previous PR that deepcopies.

Comment on lines +157 to +160
try:
self._length = len(datapipe)
except TypeError:
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. This is taking the len before any apply_sharding happens. I think that is the right number that we want to return for multiprocessing. Agree?
  2. This is only ran once (instead of per iteration) as I don't think users want different length per iteration. Is this a reasonable assumption?

@NivekT NivekT changed the title [WIP][DataLoader2] Add __len__ to DataLoader2 [DataLoader] Add __len__ to DataLoader2 Aug 18, 2022
@NivekT NivekT requested review from VitalyFedyunin and ejguan and removed request for VitalyFedyunin August 18, 2022 22:35
@NivekT NivekT added the topic: improvements topic category label Aug 18, 2022
Adding `__len__` to `DataLoader2`.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 24, 2022
ghstack-source-id: 3707994
Pull Request resolved: #728
Adding `__len__` to `DataLoader2`.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 24, 2022
ghstack-source-id: 4784fb9
Pull Request resolved: #728
@NivekT
Copy link
Contributor Author

NivekT commented Aug 24, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Adding `__len__` to `DataLoader2`.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
ghstack-source-id: 6397743
Pull Request resolved: #728
@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Adding `__len__` to `DataLoader2`.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
ghstack-source-id: 1e0ec3c
Pull Request resolved: #728
@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Adding `__len__` to `DataLoader2`.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 18, 2022
ghstack-source-id: 066a69e
Pull Request resolved: #728
@NivekT
Copy link
Contributor Author

NivekT commented Oct 18, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Adding `__len__` to `DataLoader2`.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 18, 2022
ghstack-source-id: aff9b4d
Pull Request resolved: #728
@NivekT
Copy link
Contributor Author

NivekT commented Oct 18, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +393 to +396
try:
self._length = len(datapipe)
except TypeError:
self._length = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right time/place to request length for distributed?

Adding `__len__` to `DataLoader2`. The main idea is to delegate to reading service to determine what the right length is; this will allow returning length only after the DataPipe is initialized (e.g. sharding is applied) and the correct length to be returned.

See inline comments. We should discuss the details and if this makes sense.

Fixes #549

Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Nov 28, 2022
ghstack-source-id: 3ecf741
Pull Request resolved: #728
@NivekT NivekT marked this pull request as draft February 23, 2023 01:15
@NivekT NivekT removed the request for review from VitalyFedyunin February 23, 2023 01:16
@SeanLiaoy
Copy link

SeanLiaoy commented Apr 21, 2023

@NivekT Is there any update? I think it's very helpful for using Trainers like pytorch-lightning or huggingface Transformers.

@facebook-github-bot
Copy link
Contributor

Hi @NivekT!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@NivekT
Copy link
Contributor Author

NivekT commented Jun 11, 2023

Closing. Feel free to re-open if someone else would like to work on this.

@NivekT NivekT closed this Jun 11, 2023
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/87/head branch July 18, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants