[DataPipe] Add RandomSplitter (without buffer) by NivekT · Pull Request #724 · meta-pytorch/data

NivekT · 2022-08-09T23:16:50Z

Stack from ghstack:

-> [DataPipe] Add RandomSplitter (without buffer) #724

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:

I decided against reusing _ChildDataPipe since its features are overly complicated for this use case.
I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for test and the second iteration is for valid. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: D38675266

[ghstack-poisoned]

ghstack-source-id: b76a00f Pull Request resolved: #724

torchdata/datapipes/iter/util/randomsplitter.py

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). TODO: * Decide if we like this or the buffer version better. Or we can add both. * Determines if the API related to randomness needs further extension (we might need to add set_seed) * More tests. See #712 for related discussion. See #723 for the version with buffer. [ghstack-poisoned]

ghstack-source-id: 77c39c7 Pull Request resolved: #724

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). TODO: * Decide if we like this or the buffer version better. Or we can add both. * Determines if the API related to randomness needs further extension (we might need to add set_seed) * More tests. Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. See #712 for related discussion. See #723 for the version with buffer. [ghstack-poisoned]

ghstack-source-id: 256d846 Pull Request resolved: #724

NivekT · 2022-08-12T19:57:39Z

Offline: Discussion:

This buffer-less version is likely better but we need more clear error message.
Let's support both syntax - if "target" is provided, then return only one DataPipe. Otherwise, returns a list of DataPipes. Look at the first commit.
We definitely want set_seed to allow changing of seed.
The default behavior should be same seed every epoch. We can have an argument to allow automatically changing of seed between epochs.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). TODO: * Decide if we like this or the buffer version better. Or we can add both. * Determines if the API related to randomness needs further extension (we might need to add set_seed) * More tests. Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. See #712 for related discussion. See #723 for the version with buffer. [ghstack-poisoned]

ghstack-source-id: 6679495 Pull Request resolved: #724

NivekT

Updated this PR based on our discussion.

Note: I decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for test and the second iteration is for valid. Changing seed will be confusing and causes inconsistency. Users should use set_seed instead to update the seed when necessary.

NivekT · 2022-08-12T22:11:28Z

tools/gen_pyi.py

        "dataframe": "torcharrow.DataFrame",
        "end_caching": "IterDataPipe",
        "unzip": "List[IterDataPipe]",
+        "random_split": "Union[IterDataPipe, List[IterDataPipe]]",


Is this return type fine? It will either be a single IterDataPipe or a List[...] depending if target is specified.

torchdata/datapipes/iter/util/randomsplitter.py

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. [ghstack-poisoned]

ghstack-source-id: 1dad11b Pull Request resolved: #724

NivekT · 2022-08-12T22:21:18Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 67a36f9 Pull Request resolved: #724

NivekT · 2022-08-12T22:48:12Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 0109366 Pull Request resolved: #724

NivekT · 2022-08-12T22:57:57Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2022-08-15T15:44:41Z

Can we derive total_length from source Datapipe if possible?

ghstack-source-id: d22ce0d Pull Request resolved: #724

torchdata/datapipes/iter/util/randomsplitter.py

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 21a5c13 Pull Request resolved: #724

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: e8328e4 Pull Request resolved: #724

NivekT

I think this should be mostly consistent with Shuffler with the exception that the seed doesn't change per epoch.

ejguan

Overall, LGTM with a few comments

torchdata/datapipes/iter/util/randomsplitter.py

ejguan · 2022-08-25T14:41:45Z

torchdata/datapipes/iter/util/randomsplitter.py

+    @staticmethod
+    def normalize_weights(weights, total_length: int):
+        total_weight = sum(weights.values())
+        return {k: round(float(w) * total_length / total_weight) for k, w in weights.items()}


And, using round might lead to the sum of weights not equal to total_length.

You are right. This also means we have to renormalize the weights after one of the keys run out. I have updated my implementation. Please verify.

I am unsure. If one key runs out, why do we need to renormalize? If one weight goes to 0, the corresponding key should never be returned by random.choices IMHO.

Let's say we have total_length=10 and weights=[3.33, 3.33, 3.33].
We draw 4 '0's in a row, the remaining weights will be weights=[-0.67, 3.33, 3.33].
We can set the negative value to 0 and get weights=[0, 3.33, 3.33]. Now the remaining number of draws is 6 but the weights sum up to 6.66.

It should not impact the result/output, but if we choose to normalize, it will bring the sum back to 6 to match the remaining number of draws. I can take it out since it is technically unnecessary computation.

I think I will take it out, because it will greatly simplify the other functions. For example, normalize_weights will no longer have to consider the list case.

Then, how about we always convert weights to a list of integer. For example, we do floor over all the elements except the last one. And, the rest of elements send to the last key. Then, we don't need to normalize and the weight is kept as integer then no negative value will happen.

Two things:

I was wrong about not needing to normalize after a weight hitting negative. Actually it is necessary.
Example: Splitting 10 elements to 9 DPs: [1.11, 1.11, 1.11, 1.11, 1.11, 1.11, 1.11, 1.11, 1.11]
Without normalization, you may end up with 2 or more DPs with more than one element, while the ideal case is 8
DPs with 1 element and 1 DP with 2.

I considered your suggestion to floor all but the last one. It changes the weights from its true distributino.
Example: 10 elements split into two DPs with weights [0.89, 0.11], most of the time you want 9 element in the first
and 1 element second DP.

Because of these complications, I am going to add a note, telling users that if they want certainty, please provide integer weights that sum up to total_length.

IIUC, the current implementation is similar to doing ceiling to all keys until the last requested. That's why we have to do re-normalization when each time one key is depleted. It's fine but I feel like we can do it in advance without doing normalization multiple times.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 65a2318 Pull Request resolved: #724

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 3d02cf5 Pull Request resolved: #724

NivekT · 2022-08-25T19:34:53Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

torchdata/datapipes/iter/util/randomsplitter.py

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 7e7622b Pull Request resolved: #724

NivekT · 2022-08-25T20:23:17Z

Thanks for the helpful comments. It is simpler than before now!

NivekT · 2022-08-25T20:26:00Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ejguan

Except the last comment, overall lgtm. Thank you.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 5fc7d47 Pull Request resolved: #724

NivekT · 2022-08-25T21:56:22Z

torchdata/datapipes/iter/util/randomsplitter.py

+
+    def get_length(self, target: T) -> int:
+        if not self._lengths:
+            if all(w.is_integer() for w in self.norm_weights) and sum(self.norm_weights) == self.total_length:


We can potentially assert this during __init__, such that if the normalized weights aren't integer or cannot sum up to total_length, we will raise exception.

NivekT · 2022-08-29T16:18:49Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful). Implementation note: * I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case. * I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency. See #712 for related discussion. See #723 for the version with buffer. Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266) [ghstack-poisoned]

ghstack-source-id: 9e77b6d Pull Request resolved: #724

NivekT · 2022-08-29T19:16:56Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

[DataPipe] Add RandomSplitter (without buffer)

d4eeb84

[ghstack-poisoned]

NivekT added a commit that referenced this pull request Aug 9, 2022

[DataPipe] Add RandomSplitter (without buffer)

c63e578

ghstack-source-id: b76a00f Pull Request resolved: #724

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2022

NivekT marked this pull request as draft August 9, 2022 23:17

NivekT changed the title ~~[DataPipe] Add RandomSplitter (without buffer)~~ [WIP][DataPipe] Add RandomSplitter (without buffer) Aug 9, 2022

NivekT mentioned this pull request Aug 9, 2022

[WIP][DataPipe] Add RandomSplitter (with buffer) #723

Closed

NivekT requested review from VitalyFedyunin and ejguan August 9, 2022 23:20

ejguan reviewed Aug 10, 2022

View reviewed changes

torchdata/datapipes/iter/util/randomsplitter.py Outdated Show resolved Hide resolved

NivekT added a commit that referenced this pull request Aug 10, 2022

[DataPipe] Add RandomSplitter (without buffer)

d69f006

ghstack-source-id: 77c39c7 Pull Request resolved: #724

NivekT added a commit that referenced this pull request Aug 10, 2022

[DataPipe] Add RandomSplitter (without buffer)

81799bf

ghstack-source-id: 256d846 Pull Request resolved: #724

NivekT added a commit that referenced this pull request Aug 12, 2022

[DataPipe] Add RandomSplitter (without buffer)

022c067

ghstack-source-id: 6679495 Pull Request resolved: #724

NivekT changed the title ~~[WIP][DataPipe] Add RandomSplitter (without buffer)~~ [DataPipe] Add RandomSplitter (without buffer) Aug 12, 2022

NivekT commented Aug 12, 2022

View reviewed changes

NivekT marked this pull request as ready for review August 12, 2022 22:12

NivekT added a commit that referenced this pull request Aug 12, 2022

[DataPipe] Add RandomSplitter (without buffer)

30c0ff3

ghstack-source-id: 1dad11b Pull Request resolved: #724

NivekT added the topic: new feature topic category label Aug 12, 2022

NivekT added a commit that referenced this pull request Aug 12, 2022

[DataPipe] Add RandomSplitter (without buffer)

9e73a35

ghstack-source-id: 67a36f9 Pull Request resolved: #724

NivekT added a commit that referenced this pull request Aug 12, 2022

[DataPipe] Add RandomSplitter (without buffer)

87fa4f9

ghstack-source-id: 0109366 Pull Request resolved: #724

NivekT added a commit that referenced this pull request Aug 17, 2022

[DataPipe] Add RandomSplitter (without buffer)

b73ddf7

ghstack-source-id: d22ce0d Pull Request resolved: #724

ejguan reviewed Aug 17, 2022

View reviewed changes

torchdata/datapipes/iter/util/randomsplitter.py Outdated Show resolved Hide resolved

torchdata/datapipes/iter/util/randomsplitter.py Show resolved Hide resolved

NivekT added a commit that referenced this pull request Aug 17, 2022

[DataPipe] Add RandomSplitter (without buffer)

8d5147a

ghstack-source-id: 21a5c13 Pull Request resolved: #724

NivekT added a commit that referenced this pull request Aug 17, 2022

[DataPipe] Add RandomSplitter (without buffer)

a05f981

ghstack-source-id: e8328e4 Pull Request resolved: #724

NivekT commented Aug 25, 2022

View reviewed changes

NivekT requested a review from ejguan August 25, 2022 00:21

ejguan reviewed Aug 25, 2022

View reviewed changes

NivekT added a commit that referenced this pull request Aug 25, 2022

[DataPipe] Add RandomSplitter (without buffer)

ae856de

ghstack-source-id: 65a2318 Pull Request resolved: #724

NivekT added a commit that referenced this pull request Aug 25, 2022

[DataPipe] Add RandomSplitter (without buffer)

a6540af

ghstack-source-id: 3d02cf5 Pull Request resolved: #724

ejguan reviewed Aug 25, 2022

View reviewed changes

torchdata/datapipes/iter/util/randomsplitter.py Outdated Show resolved Hide resolved

NivekT added a commit that referenced this pull request Aug 25, 2022

[DataPipe] Add RandomSplitter (without buffer)

051e0e5

ghstack-source-id: 7e7622b Pull Request resolved: #724

ejguan approved these changes Aug 25, 2022

View reviewed changes

NivekT added a commit that referenced this pull request Aug 25, 2022

[DataPipe] Add RandomSplitter (without buffer)

f709e1f

ghstack-source-id: 5fc7d47 Pull Request resolved: #724

NivekT commented Aug 25, 2022

View reviewed changes

NivekT added a commit that referenced this pull request Aug 29, 2022

[DataPipe] Add RandomSplitter (without buffer)

569e351

ghstack-source-id: 9e77b6d Pull Request resolved: #724

facebook-github-bot closed this in a46e53c Aug 29, 2022

facebook-github-bot deleted the gh/NivekT/86/head branch September 2, 2022 14:21

Conversation

NivekT commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

NivekT commented Aug 12, 2022

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NivekT commented Aug 12, 2022

Uh oh!

NivekT commented Aug 12, 2022

Uh oh!

NivekT commented Aug 12, 2022

Uh oh!

VitalyFedyunin commented Aug 15, 2022

Uh oh!

Uh oh!

Uh oh!

NivekT left a comment

Choose a reason for hiding this comment

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NivekT commented Aug 25, 2022

Uh oh!

Uh oh!

NivekT commented Aug 25, 2022

Uh oh!

NivekT commented Aug 25, 2022

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NivekT commented Aug 29, 2022

Uh oh!

NivekT commented Aug 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NivekT commented Aug 9, 2022 •

edited

Loading