move GroupByFilename Dataset to DataPipe#51709
move GroupByFilename Dataset to DataPipe#51709glaringlee wants to merge 4 commits intogh/glaringlee/45/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 30451a0 (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
| assert self.group_size > 0 | ||
| assert self.max_buffer_size >= 0 |
There was a problem hiding this comment.
Can you move it into init(), so it will break at construction.
There was a problem hiding this comment.
BTW, max_buffer_size should be larger or equal to group_size.
| class GroupByFilenameIterDataPipe(IterDataPipe): | ||
| r""" :class:`GroupByFilenameIterDataPipe`. | ||
|
|
||
| Iterable datapipe to group binary streams from input iterables by pathname without extension, | ||
| yield a list with `group_size` items in it, each item in the list is a tuple of | ||
| pathname and binary stream |
There was a problem hiding this comment.
First, the second element is not required to be binary streams, right?
We can make it more general like GroupByKeyIterDataPipe as a base class. Then we can say default behavior is using group_by_filename to group by filename.
There was a problem hiding this comment.
Let me just change this to GroupByKeyIterDataPipe, I don't think we need a base class. What will be used as a key is totally decided by the group_key_fn. I will change the comments in the code as well.
| def __init__( | ||
| self, | ||
| datapipe : Iterable[Tuple[str, Any]], | ||
| group_size : int = 1, | ||
| max_buffer_size : int = 10, | ||
| group_key_fn : Callable = default_group_key_fn, | ||
| sort_data_fn : Callable = default_sort_data_fn, | ||
| length: int = -1): |
There was a problem hiding this comment.
First, can you make all arguments after datapipe only accepting key arguments?
Second, do you think it's better to not provide default group_size, and make default max_buffer_size as Optional and initialized based on the value of group_size, like 10 times bigger than group_size.
| if self.group_size == 1: | ||
| for data in self.datapipe: | ||
| yield [data] | ||
|
|
||
| for data in self.datapipe: | ||
| key = self.group_key_fn(data) | ||
| res = self.stream_buffer.get(key, []) + [data] | ||
| if len(res) == self.group_size: | ||
| yield self.sort_data_fn(res) | ||
| del self.stream_buffer[key] | ||
| self.curr_buffer_size = self.curr_buffer_size - self.group_size + 1 | ||
| else: | ||
| if self.curr_buffer_size == self.max_buffer_size: | ||
| raise OverflowError( | ||
| "stream_buffer is overflow, please adjust the order of data " | ||
| "in the input datapipe or increase the buffer size!") | ||
| self.stream_buffer[key] = res | ||
| self.curr_buffer_size = self.curr_buffer_size + 1 |
There was a problem hiding this comment.
I am not sure if it will cause some problem. If someone implements some hack to raise StopIteration in the middle of iteration of datapipe, this generator will go from the part of group_size == 1 to the second part. So I guess it's safer to use if ... else here.
Move GroupByFilename Dataset to DataPipe [ghstack-poisoned]
ejguan
left a comment
There was a problem hiding this comment.
LGTM with one nit optimization.
| else: | ||
| for data in self.datapipe: | ||
| key = self.group_key_fn(data) | ||
| res = self.stream_buffer.get(key, []) + [data] |
There was a problem hiding this comment.
We can avoid copy here.
if key not in self.stream_buffer:
self.stream_buffer[key] = []
res = self.stream_buffer[key] # share same storage
res.append(data) | raise OverflowError( | ||
| "stream_buffer is overflow, please adjust the order of data " | ||
| "in the input datapipe or increase the buffer size!") | ||
| self.stream_buffer[key] = res |
Move GroupByFilename Dataset to DataPipe Differential Revision: [D26263585](https://our.internmc.facebook.com/intern/diff/D26263585) [ghstack-poisoned]
Move GroupByFilename Dataset to DataPipe Differential Revision: [D26263585](https://our.internmc.facebook.com/intern/diff/D26263585) [ghstack-poisoned]
|
@glaringlee merged this pull request in 015cabf. |
Summary: The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future. Tagging the following people to ask what to do to fix these `EXE002` warnings: - #50629 authored by jaglinux, approved by rohan-varma - `test/distributed/test_c10d.py` - #51262 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/__init__.py` - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py` - `torch/utils/data/datapipes/iter/listdirfiles.py` - `torch/utils/data/datapipes/iter/__init__.py` - `torch/utils/data/datapipes/utils/__init__.py` - `torch/utils/data/datapipes/utils/common.py` - #51398 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromtar.py` - #51599 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromzip.py` - #51704 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/routeddecoder.py` - `torch/utils/data/datapipes/utils/decoder.py` - #51709 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/groupbykey.py` Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang? Pull Request resolved: #52750 Test Plan: The **Lint / flake8-py3** job in GitHub Actions: - [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly - [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings - [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in #52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix Reviewed By: walterddr, janeyx99 Differential Revision: D26637307 Pulled By: samestep fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
Summary: The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future. Tagging the following people to ask what to do to fix these `EXE002` warnings: - pytorch#50629 authored by jaglinux, approved by rohan-varma - `test/distributed/test_c10d.py` - pytorch#51262 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/__init__.py` - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py` - `torch/utils/data/datapipes/iter/listdirfiles.py` - `torch/utils/data/datapipes/iter/__init__.py` - `torch/utils/data/datapipes/utils/__init__.py` - `torch/utils/data/datapipes/utils/common.py` - pytorch#51398 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromtar.py` - pytorch#51599 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromzip.py` - pytorch#51704 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/routeddecoder.py` - `torch/utils/data/datapipes/utils/decoder.py` - pytorch#51709 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/groupbykey.py` Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang? Pull Request resolved: pytorch#52750 Test Plan: The **Lint / flake8-py3** job in GitHub Actions: - [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly - [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings - [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix Reviewed By: walterddr, janeyx99 Differential Revision: D26637307 Pulled By: samestep fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
Summary: The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future. Tagging the following people to ask what to do to fix these `EXE002` warnings: - #50629 authored by jaglinux, approved by rohan-varma - `test/distributed/test_c10d.py` - #51262 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/__init__.py` - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py` - `torch/utils/data/datapipes/iter/listdirfiles.py` - `torch/utils/data/datapipes/iter/__init__.py` - `torch/utils/data/datapipes/utils/__init__.py` - `torch/utils/data/datapipes/utils/common.py` - #51398 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromtar.py` - #51599 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromzip.py` - #51704 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/routeddecoder.py` - `torch/utils/data/datapipes/utils/decoder.py` - #51709 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/groupbykey.py` Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang? Pull Request resolved: #52750 Test Plan: The **Lint / flake8-py3** job in GitHub Actions: - [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly - [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings - [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in #52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix Reviewed By: walterddr, janeyx99 Differential Revision: D26637307 Pulled By: samestep fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804 Co-authored-by: Sam Estep <sestep@fb.com>
Summary: The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future. Tagging the following people to ask what to do to fix these `EXE002` warnings: - pytorch#50629 authored by jaglinux, approved by rohan-varma - `test/distributed/test_c10d.py` - pytorch#51262 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/__init__.py` - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py` - `torch/utils/data/datapipes/iter/listdirfiles.py` - `torch/utils/data/datapipes/iter/__init__.py` - `torch/utils/data/datapipes/utils/__init__.py` - `torch/utils/data/datapipes/utils/common.py` - pytorch#51398 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromtar.py` - pytorch#51599 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromzip.py` - pytorch#51704 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/routeddecoder.py` - `torch/utils/data/datapipes/utils/decoder.py` - pytorch#51709 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/groupbykey.py` Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang? Pull Request resolved: pytorch#52750 Test Plan: The **Lint / flake8-py3** job in GitHub Actions: - [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly - [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings - [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix Reviewed By: walterddr, janeyx99 Differential Revision: D26637307 Pulled By: samestep fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
Summary: Pull Request resolved: pytorch#51709 Move GroupByFilename Dataset to DataPipe Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D26263585 Pulled By: glaringlee fbshipit-source-id: 00e3e13b47b89117f1ccfc4cd6239940a40d071e
Summary: The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future. Tagging the following people to ask what to do to fix these `EXE002` warnings: - pytorch#50629 authored by jaglinux, approved by rohan-varma - `test/distributed/test_c10d.py` - pytorch#51262 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/__init__.py` - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py` - `torch/utils/data/datapipes/iter/listdirfiles.py` - `torch/utils/data/datapipes/iter/__init__.py` - `torch/utils/data/datapipes/utils/__init__.py` - `torch/utils/data/datapipes/utils/common.py` - pytorch#51398 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromtar.py` - pytorch#51599 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromzip.py` - pytorch#51704 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/routeddecoder.py` - `torch/utils/data/datapipes/utils/decoder.py` - pytorch#51709 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/groupbykey.py` Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang? Pull Request resolved: pytorch#52750 Test Plan: The **Lint / flake8-py3** job in GitHub Actions: - [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly - [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings - [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix Reviewed By: walterddr, janeyx99 Differential Revision: D26637307 Pulled By: samestep fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
Summary: Pull Request resolved: pytorch#51709 Move GroupByFilename Dataset to DataPipe Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D26263585 Pulled By: glaringlee fbshipit-source-id: 00e3e13b47b89117f1ccfc4cd6239940a40d071e
Summary: The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future. Tagging the following people to ask what to do to fix these `EXE002` warnings: - pytorch#50629 authored by jaglinux, approved by rohan-varma - `test/distributed/test_c10d.py` - pytorch#51262 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/__init__.py` - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py` - `torch/utils/data/datapipes/iter/listdirfiles.py` - `torch/utils/data/datapipes/iter/__init__.py` - `torch/utils/data/datapipes/utils/__init__.py` - `torch/utils/data/datapipes/utils/common.py` - pytorch#51398 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromtar.py` - pytorch#51599 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/readfilesfromzip.py` - pytorch#51704 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/routeddecoder.py` - `torch/utils/data/datapipes/utils/decoder.py` - pytorch#51709 authored by glaringlee, approved by ejguan - `torch/utils/data/datapipes/iter/groupbykey.py` Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang? Pull Request resolved: pytorch#52750 Test Plan: The **Lint / flake8-py3** job in GitHub Actions: - [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly - [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings - [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix Reviewed By: walterddr, janeyx99 Differential Revision: D26637307 Pulled By: samestep fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
Stack from ghstack:
Move GroupByFilename Dataset to DataPipe
Differential Revision: D26263585