Fix dataloader hang when it is not completely iterated#9655
Fix dataloader hang when it is not completely iterated#9655ssnl wants to merge 5 commits intopytorch:masterfrom
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
This is something I was thinking of too. I don't know why but it still produces (very occasionally) a hang. I totally agree it should work, but will try to figure out why it still hangs on occasion. |
|
@pytorchbot retest this please |
|
Been running a script over and over today to check, only happened twice out of many hundred so who knows. I think it should be fine. |
|
Did it hang at one of the joins? If so, which one was it?
…On Sat, Jul 21, 2018 at 19:42 Christian Sarofeen ***@***.***> wrote:
Been running a script over and over today to check, only happened twice
out of many hundred so who knows. I think it should be fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9655 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZQDO37R8N15StnMXOW7c3F3pcb-Fks5uI7xVgaJpZM4VZN8l>
.
|
|
Thanks for checking! :)
On Sat, Jul 21, 2018 at 23:26 Tongzhou Wang <tongzhou.wang.1994@gmail.com>
wrote:
… Did it hang at one of the joins? If so, which one was it?
On Sat, Jul 21, 2018 at 19:42 Christian Sarofeen ***@***.***>
wrote:
> Been running a script over and over today to check, only happened twice
> out of many hundred so who knows. I think it should be fine.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#9655 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AFaWZQDO37R8N15StnMXOW7c3F3pcb-Fks5uI7xVgaJpZM4VZN8l>
> .
>
|
apaszke
left a comment
There was a problem hiding this comment.
Generally looks look, but I'd rather get rid of the done_event unless it's necessary
| torch.manual_seed(seed) | ||
|
|
||
| # Do not wait for putting thread to join when this worker exits. Otherwise, | ||
| # this worker may always be waiting to put and doesn't check index_queue |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| if r is None: | ||
| # use done_event so that we can get faster exiting signal even if there | ||
| # are still indices in index_queue | ||
| if r is None or done_event.is_set(): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| self.index_queues = [multiprocessing.Queue() for _ in range(self.num_workers)] | ||
| self.worker_queue_idx = 0 | ||
| self.worker_result_queue = multiprocessing.SimpleQueue() | ||
| self.worker_result_queue = multiprocessing.Queue() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| if self.pin_memory or self.timeout > 0: | ||
| if self.pin_memory: | ||
| self.data_queue = queue.Queue() | ||
| if self.pin_memory: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| # removes pids no matter what | ||
| if not self.shutdown: | ||
| self.shutdown = True | ||
| self.done_event.set() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| time.sleep(self.sleep_sec) | ||
| if not self.sleeped: | ||
| time.sleep(self.sleep_sec) | ||
| self.sleeped = True |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| if r is None: | ||
| # use done_event so that we can get faster exiting signal even if there | ||
| # are still indices in index_queue | ||
| if r is None or done_event.is_set(): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| self.index_queues = [multiprocessing.Queue() for _ in range(self.num_workers)] | ||
| self.worker_queue_idx = 0 | ||
| self.worker_result_queue = multiprocessing.SimpleQueue() | ||
| self.worker_result_queue = multiprocessing.Queue() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Summary: second trial of pytorch#7140 cc csarofeen Let's see if this works. It passes everything locally. Pull Request resolved: pytorch#9655 Differential Revision: D8940177 Pulled By: SsnL fbshipit-source-id: 8d6340fc9f7355c71e1e26b262da166402faa158
…ch#9655)" (pytorch#9804) Summary: This reverts commit 9ee5133. Pull Request resolved: pytorch#9804 Reviewed By: ezyang Differential Revision: D8987780 Pulled By: SsnL fbshipit-source-id: 75ad70b0b8d672d0b35235fa248b187be64b68e5
…orch#10366) Summary: pytorch#9655 Pull Request resolved: pytorch#10366 Differential Revision: D9237393 Pulled By: SsnL fbshipit-source-id: fabfad7f371ba33300098f6b885c0e3f26c3e14a
Summary: second trial of pytorch#7140 cc csarofeen Let's see if this works. It passes everything locally. Pull Request resolved: pytorch#9655 Differential Revision: D8940177 Pulled By: SsnL fbshipit-source-id: 8d6340fc9f7355c71e1e26b262da166402faa158
…ch#9655)" (pytorch#9804) Summary: This reverts commit 9ee5133. Pull Request resolved: pytorch#9804 Reviewed By: ezyang Differential Revision: D8987780 Pulled By: SsnL fbshipit-source-id: 75ad70b0b8d672d0b35235fa248b187be64b68e5
…orch#10366) Summary: pytorch#9655 Pull Request resolved: pytorch#10366 Differential Revision: D9237393 Pulled By: SsnL fbshipit-source-id: fabfad7f371ba33300098f6b885c0e3f26c3e14a
second trial of #7140
cc @csarofeen Let's see if this works. It passes everything locally.