Skip to content

[DataPipe] Align shuffling behavior for IterDataPipe and MapDataPipe#82974

Closed
ejguan wants to merge 5 commits intogh/ejguan/41/basefrom
gh/ejguan/41/head
Closed

[DataPipe] Align shuffling behavior for IterDataPipe and MapDataPipe#82974
ejguan wants to merge 5 commits intogh/ejguan/41/basefrom
gh/ejguan/41/head

Conversation

@ejguan
Copy link
Contributor

@ejguan ejguan commented Aug 8, 2022

Stack from ghstack:

This PR align the shuffling behavior of two types of DataPipe:

  • Shuffling will be lazily executed only after __iter__ or __getitem__ is invoked.
  • [Discussion] seed will be fixed unless set_seed is invoked.
    • BC breaking for IterDataPipe as the original behavior would a seed will be generated whenever iter is invoked, but now each iterator would share the same seed until set_seed is invoked.
  • Tests exist as test_shuffler_iterdatapipe and test_shuffler_mapdatapipe

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 8, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 64eedd0 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ejguan added 2 commits August 8, 2022 19:20
…apDataPipe"


This PR align the shuffling behavior of two types of `DataPipe`:
- Shuffling will be lazily executed only after `__iter__` or `__getitem__` is invoked.
- [Discussion] `seed` will be fixed unless `set_seed` is invoked.
  - BC breaking for `IterDataPipe` as the original behavior would a seed will be generated whenever `iter` is invoked, but now each `iterator` would share the same seed until `set_seed` is invoked.
- Tests exist as `test_shuffler_iterdatapipe` and `test_shuffler_mapdatapipe`

[ghstack-poisoned]
@ejguan ejguan requested review from NivekT and VitalyFedyunin August 8, 2022 21:31
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I agree that we should align the shuffle behaviors between two styles of DataPipe.

Letting DataLoader control this behavior makes sense. At the expense of UX of people who don't use DataPipes with DataLoader for whatever reason.

  1. I do wonder if we should allow users to toggle this behavior (same or different seed with new iterator) on and off.
  2. We need to update all the docstring and documentation to make sure that the behavior is clear to users.

len(dp1)

def test_shuffle_iterdatapipe(self):
def test_shuffler_iterdatapipe(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe add a test in the # No seed section to ensure that that order of the output is the same in two epoch even without manually resetting the seed?

Do we have tests that use the set_seed API? Maybe the current integrated test with DataLoader is sufficient since DL uses set_seed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me re-construct the whole tests for shuffle.

@ejguan
Copy link
Contributor Author

ejguan commented Aug 10, 2022

Thinking for a while on MapDataPipe.shuffle today, I find it becomes anti-intuitive to me that Shuffler is a still a MapDataPipe. We normally refer MapDataPipe as we provide an index to this instance, it gives a corresponding object. However, shuffling would make it so weird that a non-corresponding object is returned given an index. I personally think it makes more sense that MapDataPipe should be converted to an IterDataPipe after shuffling.

And, after making sure both shufflers are IterDataPipe, I am able to align the behavior of them to always update new shuffled order per iteration. I will open a separate PR for it #83202

cc: @NivekT @VitalyFedyunin

ejguan added 2 commits August 10, 2022 20:52
…apDataPipe"


This PR align the shuffling behavior of two types of `DataPipe`:
- Shuffling will be lazily executed only after `__iter__` or `__getitem__` is invoked.
- [Discussion] `seed` will be fixed unless `set_seed` is invoked.
  - BC breaking for `IterDataPipe` as the original behavior would a seed will be generated whenever `iter` is invoked, but now each `iterator` would share the same seed until `set_seed` is invoked.
- Tests exist as `test_shuffler_iterdatapipe` and `test_shuffler_mapdatapipe`

[ghstack-poisoned]
…apDataPipe"


This PR align the shuffling behavior of two types of `DataPipe`:
- Shuffling will be lazily executed only after `__iter__` or `__getitem__` is invoked.
- [Discussion] `seed` will be fixed unless `set_seed` is invoked.
  - BC breaking for `IterDataPipe` as the original behavior would a seed will be generated whenever `iter` is invoked, but now each `iterator` would share the same seed until `set_seed` is invoked.
- Tests exist as `test_shuffler_iterdatapipe` and `test_shuffler_mapdatapipe`

[ghstack-poisoned]
@ejguan
Copy link
Contributor Author

ejguan commented Aug 15, 2022

Closing this PR as we all agreed on moving MapDataPipe.shuffle to an IterDataPipe during an offline discussion.

@ejguan ejguan closed this Aug 15, 2022
ejguan added a commit that referenced this pull request Aug 16, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 16, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 16, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 16, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 17, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 17, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
However, this requires `MapDataPipe.shuffle` is converted as a `IterDataPipe`. I prefer this option simply because I find the concept of shuffling `MapDataPipe` would be shuffling indices and returning the data in the order of shuffled indices.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 18, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 18, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 23, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 23, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 24, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 24, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 26, 2022
…IterDataPipe"

Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Aug 26, 2022
Fixes: meta-pytorch/data#718



This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Aug 26, 2022
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: #83202
Approved by: https://github.com/NivekT
facebook-github-bot pushed a commit that referenced this pull request Aug 28, 2022
Summary:
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

Pull Request resolved: #83202
Approved by: https://github.com/NivekT

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/a423c966a780a1fdac6a29c6d2be2a0709de2cd5

Reviewed By: weiwangmeta

Differential Revision: D39084833

Pulled By: ejguan

fbshipit-source-id: e873f566f289b512e4f9f566149db0b9c7e5a89e
pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2022
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: #83202
Approved by: https://github.com/NivekT
pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2022
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: #83202
Approved by: https://github.com/NivekT
facebook-github-bot pushed a commit that referenced this pull request Aug 29, 2022
Summary:
Fixes: meta-pytorch/data#718

This is an alternative PR against #82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```

Pull Request resolved: #83202
Approved by: https://github.com/NivekT

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3f947264533f318355b848c070a4279032cbb5d8

Reviewed By: weiwangmeta

Differential Revision: D39099339

Pulled By: ejguan

fbshipit-source-id: e8a77d5ec05287e38b5cde1d81035d236cb63561
@facebook-github-bot facebook-github-bot deleted the gh/ejguan/41/head branch September 15, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants