Skip to content

[DataLoader2] Implementing single iterator constraint#700

Closed
NivekT wants to merge 5 commits intogh/NivekT/82/basefrom
gh/NivekT/82/head
Closed

[DataLoader2] Implementing single iterator constraint#700
NivekT wants to merge 5 commits intogh/NivekT/82/basefrom
gh/NivekT/82/head

Conversation

@NivekT
Copy link
Contributor

@NivekT NivekT commented Jul 29, 2022

Stack from ghstack:

Changes:

  • Imposing the single iterator constraint on DataLoader2 by having its __iter__ method returns a wrapper that can track if an iterator is valid or not
  • Previously, __iter__ returns self, but now it will return an instance of _DataLoader2Iterator
  • Aside from the behavior related to reset (see below), we expect other behaviors to stay the same.

Prior behavior:

dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # No reset here
print(next(it2))  # 1
print(next(it1))  # 2

New behavior with this PR:

dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
print(next(it2))  # 0
print(next(it1))  # Raises exception, since it1 is no longer valid

Differential Revision: D38262867

NivekT added a commit that referenced this pull request Jul 29, 2022
ghstack-source-id: 7343eea
Pull Request resolved: #700
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2022
@NivekT NivekT requested a review from ejguan July 29, 2022 00:00
@NivekT
Copy link
Contributor Author

NivekT commented Jul 29, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if those comments make sense to you.

Comment on lines +130 to +133
"""
Delegate methods (e.g. `limit`, `pause`, `resume`, etc) to the iterator
created by the ReadingService and DataPipe.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we directly delegate to datapipe_iter from dataloader_iter, we don't need this function.

Copy link
Contributor Author

@NivekT NivekT Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this change, but I would flag that methods such as state_dict and shutdown will not be delegated to DataLoader2. I am going to change the implementation to do that. As long as that is not an issue, we can delegate directly to dataloader_iter.

Copy link
Contributor Author

@NivekT NivekT Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to allow users to do dl2.limit(), dl2.resume(), and etc? Or we want them to always invoke those from the iterator?

If we want the former, we will need to keep this method.

Copy link
Contributor

@ejguan ejguan Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, users should only invoke such APIs on iterator object. I am fine either.

cc: @Miiira

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think invoking from one single place (i.e. iterator) rather than multiple places is better and less error-prone.

The internal test is calling .resume on the iterator, so I think it is fine to remove the API from the DataLoader2 class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we only need limit, resume on iterator. I'm good with removing this from DataLoader2. Maybe we also want to look at Lightning DataModule train_dataloader return type with this change

NivekT added a commit that referenced this pull request Jul 29, 2022
ghstack-source-id: 5255e18
Pull Request resolved: #700
NivekT added a commit that referenced this pull request Jul 29, 2022
ghstack-source-id: 7d139b7
Pull Request resolved: #700
@NivekT
Copy link
Contributor Author

NivekT commented Jul 29, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Changes:
* Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not
* Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator`
* Aside from the behavior related to reset (see below), we expect other behaviors to stay the same.

Prior behavior:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # No reset here
print(next(it2))  # 1
print(next(it1))  # 2
```

New behavior with this PR:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
print(next(it2))  # 0
print(next(it1))  # Raises exception, since it1 is no longer valid
```

Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 1, 2022
ghstack-source-id: b84c6c7
Pull Request resolved: #700
@NivekT
Copy link
Contributor Author

NivekT commented Aug 1, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor Author

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR (one fix) and added a test to make through attributes/methods are being passed through. I will be looking at internal tests again shortly.

Comment on lines +24 to +37
class _ReadingServiceWrapper:
def __init__(self, dp):
self.dp = dp

def __iter__(self):
self.it = iter(self.dp)
return self

def __next__(self):
return next(self.it)

@staticmethod
def return_one():
return 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A method that dataloader._datapipe_iter will have

Changes:
* Imposing the single iterator constraint on `DataLoader2` by having its `__iter__` method returns a wrapper that can track if an iterator is valid or not
* Previously, `__iter__` returns `self`, but now it will return an instance of `_DataLoader2Iterator`
* Aside from the behavior related to reset (see below), we expect other behaviors to stay the same.

Prior behavior:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # No reset here
print(next(it2))  # 1
print(next(it1))  # 2
```

New behavior with this PR:
```python
dl = DataLoader2(IterableWrapper(range(10)))
it1 = iter(dl)
print(next(it1))  # 0
it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
print(next(it2))  # 0
print(next(it1))  # Raises exception, since it1 is no longer valid
```

Differential Revision: [D38262867](https://our.internmc.facebook.com/intern/diff/D38262867)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 1, 2022
ghstack-source-id: 50e50b8
Pull Request resolved: #700
@NivekT
Copy link
Contributor Author

NivekT commented Aug 1, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants