[DataPipe] Align shuffling behavior for IterDataPipe and MapDataPipe#82974
[DataPipe] Align shuffling behavior for IterDataPipe and MapDataPipe#82974ejguan wants to merge 5 commits intogh/ejguan/41/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 64eedd0 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…apDataPipe" [ghstack-poisoned]
…apDataPipe" This PR align the shuffling behavior of two types of `DataPipe`: - Shuffling will be lazily executed only after `__iter__` or `__getitem__` is invoked. - [Discussion] `seed` will be fixed unless `set_seed` is invoked. - BC breaking for `IterDataPipe` as the original behavior would a seed will be generated whenever `iter` is invoked, but now each `iterator` would share the same seed until `set_seed` is invoked. - Tests exist as `test_shuffler_iterdatapipe` and `test_shuffler_mapdatapipe` [ghstack-poisoned]
NivekT
left a comment
There was a problem hiding this comment.
Overall, I agree that we should align the shuffle behaviors between two styles of DataPipe.
Letting DataLoader control this behavior makes sense. At the expense of UX of people who don't use DataPipes with DataLoader for whatever reason.
- I do wonder if we should allow users to toggle this behavior (same or different seed with new iterator) on and off.
- We need to update all the docstring and documentation to make sure that the behavior is clear to users.
| len(dp1) | ||
|
|
||
| def test_shuffle_iterdatapipe(self): | ||
| def test_shuffler_iterdatapipe(self): |
There was a problem hiding this comment.
Nit: Maybe add a test in the # No seed section to ensure that that order of the output is the same in two epoch even without manually resetting the seed?
Do we have tests that use the set_seed API? Maybe the current integrated test with DataLoader is sufficient since DL uses set_seed.
There was a problem hiding this comment.
Let me re-construct the whole tests for shuffle.
|
Thinking for a while on And, after making sure both |
…apDataPipe" This PR align the shuffling behavior of two types of `DataPipe`: - Shuffling will be lazily executed only after `__iter__` or `__getitem__` is invoked. - [Discussion] `seed` will be fixed unless `set_seed` is invoked. - BC breaking for `IterDataPipe` as the original behavior would a seed will be generated whenever `iter` is invoked, but now each `iterator` would share the same seed until `set_seed` is invoked. - Tests exist as `test_shuffler_iterdatapipe` and `test_shuffler_mapdatapipe` [ghstack-poisoned]
…apDataPipe" This PR align the shuffling behavior of two types of `DataPipe`: - Shuffling will be lazily executed only after `__iter__` or `__getitem__` is invoked. - [Discussion] `seed` will be fixed unless `set_seed` is invoked. - BC breaking for `IterDataPipe` as the original behavior would a seed will be generated whenever `iter` is invoked, but now each `iterator` would share the same seed until `set_seed` is invoked. - Tests exist as `test_shuffler_iterdatapipe` and `test_shuffler_mapdatapipe` [ghstack-poisoned]
|
Closing this PR as we all agreed on moving |
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices. [ghstack-poisoned]
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` [ghstack-poisoned]
…IterDataPipe" Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` [ghstack-poisoned]
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT
Summary: Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/a423c966a780a1fdac6a29c6d2be2a0709de2cd5 Reviewed By: weiwangmeta Differential Revision: D39084833 Pulled By: ejguan fbshipit-source-id: e873f566f289b512e4f9f566149db0b9c7e5a89e
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT
Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT
Summary: Fixes: meta-pytorch/data#718 This is an alternative PR against #82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: #83202 Approved by: https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3f947264533f318355b848c070a4279032cbb5d8 Reviewed By: weiwangmeta Differential Revision: D39099339 Pulled By: ejguan fbshipit-source-id: e8a77d5ec05287e38b5cde1d81035d236cb63561
Stack from ghstack:
This PR align the shuffling behavior of two types of
DataPipe:__iter__or__getitem__is invoked.seedwill be fixed unlessset_seedis invoked.IterDataPipeas the original behavior would a seed will be generated wheneveriteris invoked, but now eachiteratorwould share the same seed untilset_seedis invoked.test_shuffler_iterdatapipeandtest_shuffler_mapdatapipe