Skip to content

[1/4][DataPipe] Properly cleanup unclosed files within generator function#89973

Closed
ejguan wants to merge 3 commits intogh/ejguan/45/basefrom
gh/ejguan/45/head
Closed

[1/4][DataPipe] Properly cleanup unclosed files within generator function#89973
ejguan wants to merge 3 commits intogh/ejguan/45/basefrom
gh/ejguan/45/head

Conversation

@ejguan
Copy link
Contributor

@ejguan ejguan commented Nov 30, 2022

Stack from ghstack:

There is a case that file.close is never called because when generator function has never reached to the end. A simple example would be zip two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by gc. So, the line of file.close is not executed. (This is the reason that Vitaly has to create this hack to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make zip never end as it would try to deplete the infinite iterator. See: meta-pytorch/data#865

So, in this PR, I am adding a try-finally clause to make sure the file.close is always executed during the destruction of generator object. Then, we don't need the hack within zip any more.

Differential Revision: D41699470

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 30, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89973

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit d0931c8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would like @NivekT to also take a look~

@ejguan ejguan changed the title [2/4][DataPipe] Properly cleanup unclosed files within generator function [1/4][DataPipe] Properly cleanup unclosed files within generator function Dec 1, 2022
@ejguan ejguan added the topic: bug fixes topic category label Dec 1, 2022
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It'd be good if we have a unit test to check that things are closed, but not the most essential.

…erator function"


There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](https://github.com/pytorch/pytorch/blob/4451eb24e6287dff62ff8a7ec0eda6a6998807b0/torch/utils/data/datapipes/iter/combining.py#L573-L583) to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: meta-pytorch/data#865

So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more.

[ghstack-poisoned]
…erator function"


There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](https://github.com/pytorch/pytorch/blob/4451eb24e6287dff62ff8a7ec0eda6a6998807b0/torch/utils/data/datapipes/iter/combining.py#L573-L583) to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: meta-pytorch/data#865

So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more.

[ghstack-poisoned]
@ejguan
Copy link
Contributor Author

ejguan commented Dec 2, 2022

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit to meta-pytorch/data that referenced this pull request Dec 5, 2022
Summary:
There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](https://github.com/pytorch/pytorch/blob/4451eb24e6287dff62ff8a7ec0eda6a6998807b0/torch/utils/data/datapipes/iter/combining.py#L573-L583) to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: #865

So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more.

- pytorch/pytorch#89973
- pytorch/vision#6997
- pytorch/pytorch#89974

Pull Request resolved: #910

Reviewed By: wenleix, NivekT

Differential Revision: D41633909

Pulled By: ejguan

fbshipit-source-id: 5613e131dc8b2962c22d2bc7e3a4bb3039440c48
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
…tion (pytorch#89973)

There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](https://github.com/pytorch/pytorch/blob/4451eb24e6287dff62ff8a7ec0eda6a6998807b0/torch/utils/data/datapipes/iter/combining.py#L573-L583) to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: meta-pytorch/data#865

So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more.

Differential Revision: [D41699470](https://our.internmc.facebook.com/intern/diff/D41699470)
Pull Request resolved: pytorch#89973
Approved by: https://github.com/NivekT
@ejguan
Copy link
Contributor Author

ejguan commented Jan 9, 2023

This is the proper fix for meta-pytorch/data#436 that we have been trying to resolve.

@facebook-github-bot facebook-github-bot deleted the gh/ejguan/45/head branch June 8, 2023 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: dataloader release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants