DataLoader: properly diagnose exceeding file descriptor limit#34768
DataLoader: properly diagnose exceeding file descriptor limit#34768Baranowski wants to merge 12 commits intopytorch:masterfrom
Conversation
💊 CircleCI build failures summary and remediationsAs of commit fa096bc (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakages (reran 1 job to discount flakiness):
|
|
TL;DR:
I managed to reproduce the same behaviour outside PyTorch with the following script: #!/usr/bin/env python3
import sys
import socket
import os
import array
import shutil
import socket
if len(sys.argv) != 4:
print("Usage: ", sys.argv[0], " tmp_dirname iteration (send|recv)")
sys.exit(1)
if __name__ == '__main__':
dirname = sys.argv[1]
sock_path = dirname + "/sock"
iterations = int(sys.argv[2])
def dummy_path(i):
return dirname + "/" + str(i) + ".dummy"
if sys.argv[3] == 'send':
while not os.path.exists(sock_path):
pass
client = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
client.connect(sock_path)
for i in range(iterations):
fd = os.open(dummy_path(i), os.O_WRONLY | os.O_CREAT)
ancdata = array.array('i', [fd])
msg = bytes([i % 256])
print("Sending fd ", fd, " (iteration #", i, ")")
client.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, ancdata)])
else:
assert sys.argv[3] == 'recv'
if os.path.exists(dirname):
raise Exception("Directory exists")
os.mkdir(dirname)
print("Opening socket...")
server = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
server.bind(sock_path)
print("Listening...")
for i in range(iterations):
a = array.array('i')
msg, ancdata, flags, addr = server.recvmsg(1, socket.CMSG_SPACE(a.itemsize))
assert(len(ancdata) == 1)
cmsg_level, cmsg_type, cmsg_data = ancdata[0]
a.frombytes(cmsg_data)
print("Received fd ", a[0], " (iteration #", i, ")")
shutil.rmtree(dirname)Steps to reproduce:
The key problem here is that when the root process is about to receive an FD over a socket from a worker, that FD would exceed the open file descriptors limit. To mitigate that, the FD is silently removed from the message and the I haven't verified it but I suspect that this behaviour of The worker itself does not exceed the limit in the common scenario. If you tweak the ulimit and number of the FDs received/sent in the script above, instead of the I would love to add a test for this change, but I'm worried that if I lower ulimit for a single test, it might affect subsequent tests if my test exits in an unexpected way. An alternative approach would be to send as many tensors as ulimit allows. But since that environment setting can have very wide range, it might cause some unexpected issues. |
|
@ssnl do you think you would be able to review this PR? (Baranowski bug me again in a day if there's no response) |
|
Re testing, you should just run these tests in a completely separate process from the rest of the tests. There are some examples of how to do this, see for example |
There was a problem hiding this comment.
This probably shouldn't live as a top level md in the repo. Maybe put it as a comment near the relevant code.
There was a problem hiding this comment.
Sorry, I added this file to the PR by mistake. It was supposed to stay only in my local working branch.
|
Code looks generally reasonable, but I'd suggest spending a little time trying to check in the very nice test you've written. |
There was a problem hiding this comment.
The reason you need more than 1 could be that you need multiple FDs to receive that message from the socket. I imagine that it'd be impossible to get the correct # of FD needed. Maybe 10 is okay.
ssnl
left a comment
There was a problem hiding this comment.
Nice analysis!
I think the note you wrote deserves a place in the code base. You could add it closer with the actual relevant code, or just as inline comments. You can grep for NOTE in the codebase for examples. Then you can cross-ref it in dataloader.py:779 comment block.
On test:
You might find it hard to convert your script into a proper pytorch test, especially when it does ulimit -n. It would be okay, but please keep the nice script in the note. :)
5912cbd to
a7397d2
Compare
|
macos failure seems real. |
|
|
|
@pytorchbot merge this please |
|
Thanks. Sorry for the unnecessary back-and-forth at the end. |
|
No, thank you for the analysis and fix! |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Internal CI failure Maybe it's just something easy? |
|
I guess AF_UNIX sockets were not used in this test for some reason. Now that I think of it, this test should not enforce that DataLoader fails. Rather, it should enforce that DataLoader either works correctly or fails in the known way. I will fix that but I'm not sure I will manage to do it today. |
|
Thanks, no hurry. |
1002fbc to
a2751ee
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Sorry about the delay. What I think the internal problem is, is that we're invoking python subprocess and it's not picking up the same python environment as the parent process. I'm pretty sure we've done this correctly elsewhere and I need to figure out how; haven't had a chance to yet. |
I just ran into this somewhere else. A couple of things that may help:
|
|
We're using |
|
|
|
As best as I can tell, invoking python from subprocess has never actually worked in our internal build environment (I thought it might work, because there were many subprocess invocations that weren't guarded, but as best I can tell they all accidentally weren't run at all in fbcode for various other reasons). So let's just skip this test when we run it internally. You can case on this using |
75e2084 to
3b12e57
Compare
This reverts commit 45d7f101ddce82aea3bddd4e06702a60fdc7e48c.
3b12e57 to
fa096bc
Compare
|
@ezyang Done, this should be ready to merge. The CI failure looks unrelated to me. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…h#34768) Summary: Fixes pytorch#973 Common failure scenario: * DataLoader creates workers and communicates with them through SHMs * Workers send back through an AF_UNIX socket file descriptors to SHMs containing data * The limit of open files gets fully used * A FD gets stripped from a socket message coming back from a worker, without the worker knowing this. * This causes a `RuntimeError: received 0 items of ancdata` in the standard `multiprocessing` package * The exception is not handled by PyTorch and so is presented to the users. After this change the user will see ``` Traceback (most recent call last): File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/wbaranowski/git/Quansight/pytorch/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd fd = df.detach() File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 184, in recv_handle return recvfds(s, 1)[0] File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 162, in recvfds len(ancdata)) RuntimeError: received 0 items of ancdata During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in _try_get_data fs = [tempfile.NamedTemporaryFile() for i in range(10)] File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in <listcomp> fs = [tempfile.NamedTemporaryFile() for i in range(10)] File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 551, in NamedTemporaryFile (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type) File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 262, in _mkstemp_inner fd = _os.open(file, flags, 0o600) OSError: [Errno 24] Too many open files: '/tmp/tmpnx_f6v_f' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "test_shm_leak.py", line 56, in <module> worker_init_fn=worker_init_fn File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 345, in __next__ data = self._next_data() File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 861, in _next_data idx, data = self._get_data() File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 828, in _get_data success, data = self._try_get_data() File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 791, in _try_get_data "Too many open files. Communication with the" RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code ``` Pull Request resolved: pytorch#34768 Differential Revision: D20538053 Pulled By: ezyang fbshipit-source-id: be4425cf2fa02aff61619b2b829c153cb1a867cb
Fixes #973
Common failure scenario:
RuntimeError: received 0 items of ancdatain the standardmultiprocessingpackageAfter this change the user will see