Skip to content

[DataPipe] Add RandomSplitter (without buffer)#724

Closed
NivekT wants to merge 18 commits intogh/NivekT/86/basefrom
gh/NivekT/86/head
Closed

[DataPipe] Add RandomSplitter (without buffer)#724
NivekT wants to merge 18 commits intogh/NivekT/86/basefrom
gh/NivekT/86/head

Conversation

@NivekT
Copy link
Contributor

@NivekT NivekT commented Aug 9, 2022

Stack from ghstack:

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:

  • I decided against reusing _ChildDataPipe since its features are overly complicated for this use case.
  • I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for test and the second iteration is for valid. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: D38675266

NivekT added a commit that referenced this pull request Aug 9, 2022
ghstack-source-id: b76a00f
Pull Request resolved: #724
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2022
@NivekT NivekT marked this pull request as draft August 9, 2022 23:17
@NivekT NivekT changed the title [DataPipe] Add RandomSplitter (without buffer) [WIP][DataPipe] Add RandomSplitter (without buffer) Aug 9, 2022
@NivekT NivekT requested review from VitalyFedyunin and ejguan August 9, 2022 23:20
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 10, 2022
ghstack-source-id: 77c39c7
Pull Request resolved: #724
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 10, 2022
ghstack-source-id: 256d846
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 12, 2022

Offline: Discussion:

  • This buffer-less version is likely better but we need more clear error message.
  • Let's support both syntax - if "target" is provided, then return only one DataPipe. Otherwise, returns a list of DataPipes. Look at the first commit.
  • We definitely want set_seed to allow changing of seed.
  • The default behavior should be same seed every epoch. We can have an argument to allow automatically changing of seed between epochs.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
ghstack-source-id: 6679495
Pull Request resolved: #724
@NivekT NivekT changed the title [WIP][DataPipe] Add RandomSplitter (without buffer) [DataPipe] Add RandomSplitter (without buffer) Aug 12, 2022
Copy link
Contributor Author

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR based on our discussion.

Note: I decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for test and the second iteration is for valid. Changing seed will be confusing and causes inconsistency. Users should use set_seed instead to update the seed when necessary.

"dataframe": "torcharrow.DataFrame",
"end_caching": "IterDataPipe",
"unzip": "List[IterDataPipe]",
"random_split": "Union[IterDataPipe, List[IterDataPipe]]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this return type fine? It will either be a single IterDataPipe or a List[...] depending if target is specified.

@NivekT NivekT marked this pull request as ready for review August 12, 2022 22:12
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
ghstack-source-id: 1dad11b
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 12, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@NivekT NivekT added the topic: new feature topic category label Aug 12, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
ghstack-source-id: 67a36f9
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 12, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
ghstack-source-id: 0109366
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 12, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Can we derive total_length from source Datapipe if possible?

NivekT added a commit that referenced this pull request Aug 17, 2022
ghstack-source-id: d22ce0d
Pull Request resolved: #724
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
ghstack-source-id: 21a5c13
Pull Request resolved: #724
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
ghstack-source-id: e8328e4
Pull Request resolved: #724
Copy link
Contributor Author

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be mostly consistent with Shuffler with the exception that the seed doesn't change per epoch.

@NivekT NivekT requested a review from ejguan August 25, 2022 00:21
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM with a few comments

@staticmethod
def normalize_weights(weights, total_length: int):
total_weight = sum(weights.values())
return {k: round(float(w) * total_length / total_weight) for k, w in weights.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, using round might lead to the sum of weights not equal to total_length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This also means we have to renormalize the weights after one of the keys run out. I have updated my implementation. Please verify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure. If one key runs out, why do we need to renormalize? If one weight goes to 0, the corresponding key should never be returned by random.choices IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have total_length=10 and weights=[3.33, 3.33, 3.33].
We draw 4 '0's in a row, the remaining weights will be weights=[-0.67, 3.33, 3.33].
We can set the negative value to 0 and get weights=[0, 3.33, 3.33]. Now the remaining number of draws is 6 but the weights sum up to 6.66.

It should not impact the result/output, but if we choose to normalize, it will bring the sum back to 6 to match the remaining number of draws. I can take it out since it is technically unnecessary computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will take it out, because it will greatly simplify the other functions. For example, normalize_weights will no longer have to consider the list case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, how about we always convert weights to a list of integer. For example, we do floor over all the elements except the last one. And, the rest of elements send to the last key. Then, we don't need to normalize and the weight is kept as integer then no negative value will happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. I was wrong about not needing to normalize after a weight hitting negative. Actually it is necessary.
    Example: Splitting 10 elements to 9 DPs: [1.11, 1.11, 1.11, 1.11, 1.11, 1.11, 1.11, 1.11, 1.11]
    Without normalization, you may end up with 2 or more DPs with more than one element, while the ideal case is 8
    DPs with 1 element and 1 DP with 2.
  2. I considered your suggestion to floor all but the last one. It changes the weights from its true distributino.
    Example: 10 elements split into two DPs with weights [0.89, 0.11], most of the time you want 9 element in the first
    and 1 element second DP.

Because of these complications, I am going to add a note, telling users that if they want certainty, please provide integer weights that sum up to total_length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the current implementation is similar to doing ceiling to all keys until the last requested. That's why we have to do re-normalization when each time one key is depleted. It's fine but I feel like we can do it in advance without doing normalization multiple times.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
ghstack-source-id: 65a2318
Pull Request resolved: #724
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
ghstack-source-id: 3d02cf5
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
ghstack-source-id: 7e7622b
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

Thanks for the helpful comments. It is simpler than before now!

@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the last comment, overall lgtm. Thank you.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
ghstack-source-id: 5fc7d47
Pull Request resolved: #724

def get_length(self, target: T) -> int:
if not self._lengths:
if all(w.is_integer() for w in self.norm_weights) and sum(self.norm_weights) == self.total_length:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can potentially assert this during __init__, such that if the normalized weights aren't integer or cannot sum up to total_length, we will raise exception.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 29, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 29, 2022
ghstack-source-id: 9e77b6d
Pull Request resolved: #724
@NivekT
Copy link
Contributor Author

NivekT commented Aug 29, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot facebook-github-bot deleted the gh/NivekT/86/head branch September 2, 2022 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: new feature topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants