Skip to content

move GroupByFilename Dataset to DataPipe#51709

Closed
glaringlee wants to merge 4 commits intogh/glaringlee/45/basefrom
gh/glaringlee/45/head
Closed

move GroupByFilename Dataset to DataPipe#51709
glaringlee wants to merge 4 commits intogh/glaringlee/45/basefrom
gh/glaringlee/45/head

Conversation

@glaringlee
Copy link
Copy Markdown
Contributor

@glaringlee glaringlee commented Feb 4, 2021

Stack from ghstack:

Move GroupByFilename Dataset to DataPipe

Differential Revision: D26263585

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Feb 4, 2021

💊 CI failures summary and remediations

As of commit 30451a0 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_xenial_py3_clang5_asan_build Build 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

glaringlee pushed a commit that referenced this pull request Feb 4, 2021
ghstack-source-id: c9d8271
Pull Request resolved: #51709
Comment on lines +71 to +72
assert self.group_size > 0
assert self.max_buffer_size >= 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move it into init(), so it will break at construction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, max_buffer_size should be larger or equal to group_size.

Comment on lines +35 to +40
class GroupByFilenameIterDataPipe(IterDataPipe):
r""" :class:`GroupByFilenameIterDataPipe`.

Iterable datapipe to group binary streams from input iterables by pathname without extension,
yield a list with `group_size` items in it, each item in the list is a tuple of
pathname and binary stream
Copy link
Copy Markdown
Contributor

@ejguan ejguan Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the second element is not required to be binary streams, right?
We can make it more general like GroupByKeyIterDataPipe as a base class. Then we can say default behavior is using group_by_filename to group by filename.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just change this to GroupByKeyIterDataPipe, I don't think we need a base class. What will be used as a key is totally decided by the group_key_fn. I will change the comments in the code as well.

Comment on lines +51 to +58
def __init__(
self,
datapipe : Iterable[Tuple[str, Any]],
group_size : int = 1,
max_buffer_size : int = 10,
group_key_fn : Callable = default_group_key_fn,
sort_data_fn : Callable = default_sort_data_fn,
length: int = -1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, can you make all arguments after datapipe only accepting key arguments?
Second, do you think it's better to not provide default group_size, and make default max_buffer_size as Optional and initialized based on the value of group_size, like 10 times bigger than group_size.

Comment on lines +74 to +91
if self.group_size == 1:
for data in self.datapipe:
yield [data]

for data in self.datapipe:
key = self.group_key_fn(data)
res = self.stream_buffer.get(key, []) + [data]
if len(res) == self.group_size:
yield self.sort_data_fn(res)
del self.stream_buffer[key]
self.curr_buffer_size = self.curr_buffer_size - self.group_size + 1
else:
if self.curr_buffer_size == self.max_buffer_size:
raise OverflowError(
"stream_buffer is overflow, please adjust the order of data "
"in the input datapipe or increase the buffer size!")
self.stream_buffer[key] = res
self.curr_buffer_size = self.curr_buffer_size + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it will cause some problem. If someone implements some hack to raise StopIteration in the middle of iteration of datapipe, this generator will go from the part of group_size == 1 to the second part. So I guess it's safer to use if ... else here.

Move GroupByFilename Dataset to DataPipe

[ghstack-poisoned]
glaringlee pushed a commit that referenced this pull request Feb 4, 2021
ghstack-source-id: 64820fb
Pull Request resolved: #51709
@glaringlee glaringlee requested a review from ejguan February 4, 2021 23:57
Copy link
Copy Markdown
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one nit optimization.

else:
for data in self.datapipe:
key = self.group_key_fn(data)
res = self.stream_buffer.get(key, []) + [data]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid copy here.

if key not in self.stream_buffer:
    self.stream_buffer[key] = []
res = self.stream_buffer[key]  # share same storage
res.append(data) 

raise OverflowError(
"stream_buffer is overflow, please adjust the order of data "
"in the input datapipe or increase the buffer size!")
self.stream_buffer[key] = res
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's not needed here.

Move GroupByFilename Dataset to DataPipe

Differential Revision: [D26263585](https://our.internmc.facebook.com/intern/diff/D26263585)

[ghstack-poisoned]
Move GroupByFilename Dataset to DataPipe

Differential Revision: [D26263585](https://our.internmc.facebook.com/intern/diff/D26263585)

[ghstack-poisoned]
glaringlee pushed a commit that referenced this pull request Feb 8, 2021
ghstack-source-id: b931524
Pull Request resolved: #51709
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@glaringlee merged this pull request in 015cabf.

@facebook-github-bot facebook-github-bot deleted the gh/glaringlee/45/head branch February 12, 2021 15:18
facebook-github-bot pushed a commit that referenced this pull request Feb 24, 2021
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- #50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- #51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- #51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- #51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- #51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- #51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: #52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in #52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
walterddr pushed a commit to walterddr/pytorch that referenced this pull request Feb 25, 2021
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- pytorch#50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- pytorch#51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- pytorch#51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- pytorch#51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- pytorch#51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- pytorch#51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: pytorch#52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
malfet pushed a commit that referenced this pull request Feb 26, 2021
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- #50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- #51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- #51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- #51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- #51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- #51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: #52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in #52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804

Co-authored-by: Sam Estep <sestep@fb.com>
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- pytorch#50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- pytorch#51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- pytorch#51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- pytorch#51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- pytorch#51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- pytorch#51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: pytorch#52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Pull Request resolved: pytorch#51709

Move GroupByFilename Dataset to DataPipe

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26263585

Pulled By: glaringlee

fbshipit-source-id: 00e3e13b47b89117f1ccfc4cd6239940a40d071e
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- pytorch#50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- pytorch#51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- pytorch#51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- pytorch#51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- pytorch#51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- pytorch#51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: pytorch#52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#51709

Move GroupByFilename Dataset to DataPipe

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26263585

Pulled By: glaringlee

fbshipit-source-id: 00e3e13b47b89117f1ccfc4cd6239940a40d071e
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- pytorch#50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- pytorch#51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- pytorch#51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- pytorch#51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- pytorch#51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- pytorch#51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: pytorch#52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in pytorch#52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants