Skip to content

[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe#83202

Closed
ejguan wants to merge 10 commits intogh/ejguan/43/basefrom
gh/ejguan/43/head
Closed

[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe#83202
ejguan wants to merge 10 commits intogh/ejguan/43/basefrom
gh/ejguan/43/head

Conversation

@ejguan
Copy link
Contributor

@ejguan ejguan commented Aug 10, 2022

Fixes: meta-pytorch/data#718

Stack from ghstack:

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as IterDataPipe.shuffle

  • Lazily generating seed per iteration
  • Each iterators has a new seed
  • Convert MapDataPipe.shuffle to an IterDataPipe

BC-breaking Note:

This PR changes the return type of MapDataPipe.shuffle from a MapDataPipe to a IterDataPipe.

1. 12

Output as MapDataPipe

>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False

This PR:

Output as IterDataPipe

>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True

@diff-train-skip-merge

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 10, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit b8111b3 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ejguan added a commit that referenced this pull request Aug 10, 2022
ghstack-source-id: ba52215
Pull Request resolved: #83202
ejguan added a commit that referenced this pull request Aug 10, 2022
ghstack-source-id: 6c7e273
Pull Request resolved: #83202
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes are needed for interface generation.

Please update the MapDataPipe version's docstring so that users know it will be converted to IterDataPipe as well.

Question: Will the duplicate name ShufflerIterDataPipe cause any problem? I can't think of any since they are separated into different directories. I think it should be fine.

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added 3 commits August 16, 2022 21:04
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! We have the questions as the last one; will approve as soon as we finalize the details on the last one and modify this one the similarly.

ejguan added 2 commits August 23, 2022 20:29
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation LGTM! Just need to update the docstring. Some ideas:

  1. Mention that IterDataPipe will now be returned (and possibly why and use converter to get MapDataPipe back again) - you can decide what to mention
  2. In the example, call list(shuffle_dp) once more to show that the result will be different in the second call
  3. We can potentially mention the set_seed API but if we expect users to rely on DataLoader then maybe not

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

[ghstack-poisoned]
@ejguan
Copy link
Contributor Author

ejguan commented Aug 26, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

facebook-github-bot pushed a commit that referenced this pull request Aug 28, 2022
Summary:
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

Pull Request resolved: #83202
Approved by: https://github.com/NivekT

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/a423c966a780a1fdac6a29c6d2be2a0709de2cd5

Reviewed By: weiwangmeta

Differential Revision: D39084833

Pulled By: ejguan

fbshipit-source-id: e873f566f289b512e4f9f566149db0b9c7e5a89e
@facebook-github-bot
Copy link
Contributor

@pytorchbot revert -m="Diff reverted internally" -c="ghfirst"

This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

@ejguan your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 28, 2022
@weiwangmeta
Copy link
Contributor

@pytorchbot merge -l

@weiwangmeta weiwangmeta reopened this Aug 29, 2022
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here and land check progress here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR and the land checks have passed (ETA 4 Hours). If you need to coordinate lands between different changes and cannot risk a land race, please add the ciflow/trunk label to your PR and wait for signal to complete, and then land your changes in proper order. Having trunk, pull, and Lint pre-run on a PR will bypass land checks and the ETA should be immediate. If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2022
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: #83202
Approved by: https://github.com/NivekT
facebook-github-bot pushed a commit that referenced this pull request Aug 29, 2022
…ipe (#83202)

Test Plan: revert-hammer

Differential Revision:
D39084833

Original commit changeset: e873f566f289

Original Phabricator Diff: D39084833

fbshipit-source-id: 12a9f85310d140e87ca23472ec6c08c33625f408
facebook-github-bot pushed a commit that referenced this pull request Aug 29, 2022
Summary:
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

Pull Request resolved: #83202
Approved by: https://github.com/NivekT

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3f947264533f318355b848c070a4279032cbb5d8

Reviewed By: weiwangmeta

Differential Revision: D39099339

Pulled By: ejguan

fbshipit-source-id: e8a77d5ec05287e38b5cde1d81035d236cb63561
@facebook-github-bot facebook-github-bot deleted the gh/ejguan/43/head branch September 1, 2022 14:20
pytorchmergebot pushed a commit that referenced this pull request Sep 13, 2022
This PR requires PR is landed: #83202

## changes
- For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`.
- Change the API from `apply_shuffle_seed` to `apply_random_seed`.
- Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch.
- Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`.
Pull Request resolved: #83741
Approved by: https://github.com/NivekT
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
This PR requires PR is landed: #83202

## changes
- For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`.
- Change the API from `apply_shuffle_seed` to `apply_random_seed`.
- Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch.
- Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`.
Pull Request resolved: #83741
Approved by: https://github.com/NivekT
Rick0317 pushed a commit to Rick0317/pytorch that referenced this pull request Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants