[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe#83202
[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe#83202ejguan wants to merge 10 commits intogh/ejguan/43/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit b8111b3 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
[ghstack-poisoned]
[ghstack-poisoned]
NivekT
left a comment
There was a problem hiding this comment.
Small changes are needed for interface generation.
Please update the MapDataPipe version's docstring so that users know it will be converted to IterDataPipe as well.
Question: Will the duplicate name ShufflerIterDataPipe cause any problem? I can't think of any since they are separated into different directories. I think it should be fine.
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
NivekT
left a comment
There was a problem hiding this comment.
Overall LGTM! We have the questions as the last one; will approve as soon as we finalize the details on the last one and modify this one the similarly.
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
NivekT
left a comment
There was a problem hiding this comment.
The implementation LGTM! Just need to update the docstring. Some ideas:
- Mention that
IterDataPipewill now be returned (and possibly why and use converter to getMapDataPipeback again) - you can decide what to mention - In the example, call
list(shuffle_dp)once more to show that the result will be different in the second call - We can potentially mention the
set_seedAPI but if we expect users to rely onDataLoaderthen maybe not
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` [ghstack-poisoned]
|
@pytorchbot merge -g |
|
@pytorchbot successfully started a merge job. Check the current status here. |
Summary: Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/a423c966a780a1fdac6a29c6d2be2a0709de2cd5 Reviewed By: weiwangmeta Differential Revision: D39084833 Pulled By: ejguan fbshipit-source-id: e873f566f289b512e4f9f566149db0b9c7e5a89e
|
@pytorchbot revert -m="Diff reverted internally" -c="ghfirst" This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).) |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@ejguan your PR has been successfully reverted. |
This reverts commit a423c96. Reverted #83202 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
|
@pytorchbot merge -l |
|
@pytorchbot successfully started a merge job. Check the current status here and land check progress here. |
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT
…ipe (#83202) Test Plan: revert-hammer Differential Revision: D39084833 Original commit changeset: e873f566f289 Original Phabricator Diff: D39084833 fbshipit-source-id: 12a9f85310d140e87ca23472ec6c08c33625f408
Summary: Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3f947264533f318355b848c070a4279032cbb5d8 Reviewed By: weiwangmeta Differential Revision: D39099339 Pulled By: ejguan fbshipit-source-id: e8a77d5ec05287e38b5cde1d81035d236cb63561
This PR requires PR is landed: #83202 ## changes - For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`. - Change the API from `apply_shuffle_seed` to `apply_random_seed`. - Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch. - Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`. Pull Request resolved: #83741 Approved by: https://github.com/NivekT
This PR requires PR is landed: #83202 ## changes - For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`. - Change the API from `apply_shuffle_seed` to `apply_random_seed`. - Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch. - Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`. Pull Request resolved: #83741 Approved by: https://github.com/NivekT
…ization ghstack-source-id: 7b310d0 Pull Request resolved: pytorch/pytorch#83202
Fixes: meta-pytorch/data#718
Stack from ghstack:
This is an alternative PR against #82974
This PR would change the behavior for both types to the same behavior as
IterDataPipe.shuffleMapDataPipe.shuffleto anIterDataPipeBC-breaking Note:
This PR changes the return type of
MapDataPipe.shufflefrom aMapDataPipeto aIterDataPipe.1. 12
Output as
MapDataPipeThis PR:
Output as
IterDataPipe@diff-train-skip-merge