Skip to content

[WIP][DataPipe] Add RandomSplitter (with buffer)#723

Closed
NivekT wants to merge 1 commit intogh/NivekT/85/basefrom
gh/NivekT/85/head
Closed

[WIP][DataPipe] Add RandomSplitter (with buffer)#723
NivekT wants to merge 1 commit intogh/NivekT/85/basefrom
gh/NivekT/85/head

Conversation

@NivekT
Copy link
Contributor

@NivekT NivekT commented Aug 9, 2022

Stack from ghstack:

This PR adds RandomSplitter with an implementation that uses a buffer through demux, thus allowing all child DataPipes to be used simultaneously. This may not work for the memory-bound cases.

TODO:

  • Decide if we like a buffer-less version better. Or we can add both.
  • Determines if the API related to randomness needs further extension (we might need to add set_seed)
  • More tests.

See #712 for related discussion.
See #724 for the version WITHOUT buffer.

NivekT added a commit that referenced this pull request Aug 9, 2022
ghstack-source-id: 783d4a7
Pull Request resolved: #723
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2022
@NivekT NivekT changed the title [DataPipe] Add RandomSplitter (with buffer) [WIP][DataPipe] Add RandomSplitter (with buffer) Aug 9, 2022
@NivekT NivekT marked this pull request as draft August 9, 2022 22:24
@NivekT NivekT requested review from VitalyFedyunin and ejguan August 9, 2022 22:34
@NivekT
Copy link
Contributor Author

NivekT commented Aug 9, 2022

@ejguan @VitalyFedyunin This is WIP, we should discuss this but let me know if there is any initial reaction on how we would like to do random_split.

NivekT added a commit that referenced this pull request Aug 10, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 10, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
…t buffer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

TODO:
* Decide if we like this or the buffer version better. Or we can add both.
* Determines if the API related to randomness needs further extension (we might need to add set_seed)
* More tests.

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Aug 12, 2022

Closing this PR for now since we are moving forward with the buffer-less version. We can re-visit this if there is a need for it.

@NivekT NivekT closed this Aug 12, 2022
NivekT added a commit that referenced this pull request Aug 12, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 12, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 15, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 15, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 15, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 15, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 16, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 16, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 17, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 25, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 29, 2022
…fer)"


This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Aug 29, 2022
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Differential Revision: [D38675266](https://our.internmc.facebook.com/intern/diff/D38675266)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Aug 29, 2022
Summary:
Pull Request resolved: #724

This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).

Implementation note:
* I decided against reusing `_ChildDataPipe` since its features are overly complicated for this use case.
* I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for `test` and the second iteration is for `valid`. Changing seed will be confusing and causes inconsistency.

See #712 for related discussion.
See #723 for the version with buffer.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D38675266

Pulled By: NivekT

fbshipit-source-id: 137ea860367aab9b02fd1645bf7cc0429ab1f018
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/85/head branch September 12, 2022 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants