Skip to content

DataLoader: properly diagnose exceeding file descriptor limit#34768

Closed
Baranowski wants to merge 12 commits intopytorch:masterfrom
Quansight:wbaranowski-fd_leak-973
Closed

DataLoader: properly diagnose exceeding file descriptor limit#34768
Baranowski wants to merge 12 commits intopytorch:masterfrom
Quansight:wbaranowski-fd_leak-973

Conversation

@Baranowski
Copy link
Copy Markdown
Contributor

Fixes #973

Common failure scenario:

  • DataLoader creates workers and communicates with them through SHMs
  • Workers send back through an AF_UNIX socket file descriptors to SHMs containing data
  • The limit of open files gets fully used
  • A FD gets stripped from a socket message coming back from a worker, without the worker knowing this.
  • This causes a RuntimeError: received 0 items of ancdata in the standard multiprocessing package
  • The exception is not handled by PyTorch and so is presented to the users.

After this change the user will see

Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/wbaranowski/git/Quansight/pytorch/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 184, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 162, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 551, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 262, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpnx_f6v_f'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_shm_leak.py", line 56, in <module>
    worker_init_fn=worker_init_fn
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 861, in _next_data
    idx, data = self._get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 828, in _get_data
    success, data = self._try_get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 791, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code

@Baranowski Baranowski requested a review from apaszke as a code owner March 14, 2020 19:17
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Mar 14, 2020

💊 CircleCI build failures summary and remediations

As of commit fa096bc (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakages (reran 1 job to discount flakiness):

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Test" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

Apr 10 08:03:39 AssertionError: 12 not less than or equal to 1e-05 :
Apr 10 08:03:39 ---------------------------------------------------------------------- 
Apr 10 08:03:39 Traceback (most recent call last): 
Apr 10 08:03:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Apr 10 08:03:39     self._join_processes(fn) 
Apr 10 08:03:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 285, in _join_processes 
Apr 10 08:03:39     self._check_return_codes(elapsed_time) 
Apr 10 08:03:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 324, in _check_return_codes 
Apr 10 08:03:39     self.assertEqual(p.exitcode, first_process.exitcode) 
Apr 10 08:03:39   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 927, in assertEqual 
Apr 10 08:03:39     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Apr 10 08:03:39 AssertionError: 12 not less than or equal to 1e-05 :  
Apr 10 08:03:39  
Apr 10 08:03:39 ---------------------------------------------------------------------- 
Apr 10 08:03:39 Ran 62 tests in 88.523s 
Apr 10 08:03:39  
Apr 10 08:03:39 FAILED (failures=1, skipped=2) 
Apr 10 08:03:39  
Apr 10 08:03:39 Generating XML reports... 
Apr 10 08:03:39 Generated XML report: test-reports/dist-gloo/TEST-DistAutogradTestWithSpawn-20200410080210.xml 
Apr 10 08:03:39 Traceback (most recent call last): 
Apr 10 08:03:39   File "test/run_test.py", line 696, in <module> 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 41 times.

@Baranowski
Copy link
Copy Markdown
Contributor Author

Baranowski commented Mar 14, 2020

TL;DR:

  • The key issue is most likely in the Python packages multiprocessing, socket or even the C implementation of an AF_UNIX socket. IMO it's best to work around it in PyTorch
  • Adding a test would be challenging because the size of the data used in a test would fully depend on the host's limits, or alternatively it might mess up ulimit for subsequent tests if the exit is not clean.

I managed to reproduce the same behaviour outside PyTorch with the following script:

#!/usr/bin/env python3
import sys
import socket
import os
import array
import shutil
import socket


if len(sys.argv) != 4:
    print("Usage: ", sys.argv[0], " tmp_dirname iteration (send|recv)")
    sys.exit(1)

if __name__ == '__main__':
    dirname = sys.argv[1]
    sock_path = dirname + "/sock"
    iterations = int(sys.argv[2])
    def dummy_path(i):
        return dirname + "/" + str(i) + ".dummy"


    if sys.argv[3] == 'send':
        while not os.path.exists(sock_path):
            pass
        client = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
        client.connect(sock_path)
        for i in range(iterations):
            fd = os.open(dummy_path(i), os.O_WRONLY | os.O_CREAT)
            ancdata = array.array('i', [fd])
            msg = bytes([i % 256])
            print("Sending fd ", fd, " (iteration #", i, ")")
            client.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, ancdata)])


    else:
        assert sys.argv[3] == 'recv'

        if os.path.exists(dirname):
            raise Exception("Directory exists")

        os.mkdir(dirname)

        print("Opening socket...")
        server = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
        server.bind(sock_path)

        print("Listening...")
        for i in range(iterations):
            a = array.array('i')
            msg, ancdata, flags, addr = server.recvmsg(1, socket.CMSG_SPACE(a.itemsize))
            assert(len(ancdata) == 1)
            cmsg_level, cmsg_type, cmsg_data = ancdata[0]
            a.frombytes(cmsg_data)
            print("Received fd ", a[0], " (iteration #", i, ")")

        shutil.rmtree(dirname)

Steps to reproduce:

  1. Run two shells and set lower file descriptor limit in the receiving one:
(shell1) ulimit -n 1020
(shell2) ulimit -n 1022
  1. Run the script above with the recv option in the first shell
./test_socket.py sock_tmp 1017 recv
  1. Run the script with the send option in the second shell:
./test_socket.py sock_tmp 1017 send

The key problem here is that when the root process is about to receive an FD over a socket from a worker, that FD would exceed the open file descriptors limit. To mitigate that, the FD is silently removed from the message and the MSG_CTRUNC flag is set. (The flag might be a bit misleading since the manpage says that it indicates that some control data were discarded due to lack of space in the buffer for ancillary data. It doesn't really matter, though, since the standard multiprocessing package ignores it.)

I haven't verified it but I suspect that this behaviour of recvmsg in Python's socket is consistent with recvmsg in C. I'm not sure it's the desired behaviour. AFAICT, there is no indication of an issue anywhere, other than the MSG_CTRUNC flag. Anyway, digging into the C implementation of recvmsg would be fun but outside of the scope of the PyTorch issue.

The worker itself does not exceed the limit in the common scenario. If you tweak the ulimit and number of the FDs received/sent in the script above, instead of the RuntimeError: received 0 items of ancdata you might get OSError: Too many open files on the receiver/sender or both. During my testing of PyTorch I have seen that OSError passed through the DataLoader a couple times but I couldn't reproduce it afterwards.

I would love to add a test for this change, but I'm worried that if I lower ulimit for a single test, it might affect subsequent tests if my test exits in an unexpected way. An alternative approach would be to send as many tensors as ulimit allows. But since that environment setting can have very wide range, it might cause some unexpected issues.

@Baranowski Baranowski requested a review from ezyang March 14, 2020 19:40
@ezyang ezyang requested a review from ssnl March 16, 2020 19:52
@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Mar 16, 2020

@ssnl do you think you would be able to review this PR? (Baranowski bug me again in a day if there's no response)

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Mar 16, 2020

Re testing, you should just run these tests in a completely separate process from the rest of the tests. There are some examples of how to do this, see for example TestCuda.test_cuda_init_race (in general, grep for subprocess in the test suite)

Comment thread fd_leak_comment.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Comment thread fd_leak_comment.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't live as a top level md in the repo. Maybe put it as a comment near the relevant code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I added this file to the PR by mistake. It was supposed to stay only in my local working branch.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Mar 16, 2020

Code looks generally reasonable, but I'd suggest spending a little time trying to check in the very nice test you've written.

Comment thread torch/utils/data/dataloader.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason you need more than 1 could be that you need multiple FDs to receive that message from the socket. I imagine that it'd be impossible to get the correct # of FD needed. Maybe 10 is okay.

Copy link
Copy Markdown
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice analysis!

I think the note you wrote deserves a place in the code base. You could add it closer with the actual relevant code, or just as inline comments. You can grep for NOTE in the codebase for examples. Then you can cross-ref it in dataloader.py:779 comment block.

On test:
You might find it hard to convert your script into a proper pytorch test, especially when it does ulimit -n. It would be okay, but please keep the nice script in the note. :)

@Baranowski Baranowski force-pushed the wbaranowski-fd_leak-973 branch 2 times, most recently from 5912cbd to a7397d2 Compare March 18, 2020 15:14
Comment thread test/test_dataloader.py Outdated
@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Mar 18, 2020

macos failure seems real.

@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Mar 18, 2020

NameError: name 'IS_MACOS' is not defined

Copy link
Copy Markdown
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Mar 18, 2020

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Mar 18, 2020
@Baranowski
Copy link
Copy Markdown
Contributor Author

Thanks. Sorry for the unnecessary back-and-forth at the end.

@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Mar 19, 2020

No, thank you for the analysis and fix!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Mar 20, 2020

Internal CI failure

  warnings.warn(err_msg)
test_fd_limit_exceeded (test_dataloader.TestDataLoader) ... I0319 21:23:47.775116 749361 init.cc:561] Skip logging in unit test environment for event: tensor.create
Traceback (most recent call last):
  File "<string>", line 3, in <module>
ModuleNotFoundError: No module named 'torch.utils'
ERROR

======================================================================
ERROR: test_fd_limit_exceeded (test_dataloader.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dev/gen/caffe2/test/dataloader#binary,link-tree/test_dataloader.py", line 816, in test_fd_limit_exceeded
    """])
  File "/usr/local/fbcode/platform007/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/fbcode/platform007/bin/python3.7', '-c', 'import torch\nimport resource\nfrom torch.utils.data import DataLoader, IterableDataset\n\nclass RandomDataset(IterableDataset):\n    def __init__(self, len, size):\n        super(RandomDataset).__init__()\n        self.len = len\n        self.size = size\n\n    def __iter__(self):\n        return self\n\n    def __next__(self):\n        if self.len <= 0:\n            raise StopIteration\n        self.len -= 1\n        return torch.randn(self.size)\n\ntry:\n    keep_fds_alive = []\n    resource.setrlimit(resource.RLIMIT_NOFILE, (100, 100))\n    for random_t in DataLoader(RandomDataset(200, (2,2)),\n                               num_workers=1):\n      random_t.max(dim=0)\n      keep_fds_alive.append(random_t)\nexcept RuntimeError as e:\n    assert "ulimit -n" in str(e)\n    assert "set_sharing_strategy" in str(e)\nelse:\n    assert False, "Expected a RuntimeError with keywords ulimit and set_sharing_strategy"\n']' returned non-zero exit status 1.

----------------------------------------------------------------------
R

Maybe it's just something easy?

@Baranowski
Copy link
Copy Markdown
Contributor Author

I guess AF_UNIX sockets were not used in this test for some reason.

Now that I think of it, this test should not enforce that DataLoader fails. Rather, it should enforce that DataLoader either works correctly or fails in the known way. I will fix that but I'm not sure I will manage to do it today.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Mar 20, 2020

Thanks, no hurry.

@Baranowski Baranowski force-pushed the wbaranowski-fd_leak-973 branch 2 times, most recently from 1002fbc to a2751ee Compare March 21, 2020 19:07
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Apr 8, 2020

Sorry about the delay. What I think the internal problem is, is that we're invoking python subprocess and it's not picking up the same python environment as the parent process. I'm pretty sure we've done this correctly elsewhere and I need to figure out how; haven't had a chance to yet.

@ezyang ezyang closed this Apr 8, 2020
@ezyang ezyang reopened this Apr 8, 2020
@rgommers
Copy link
Copy Markdown
Collaborator

rgommers commented Apr 9, 2020

we're invoking python subprocess and it's not picking up the same python environment as the parent process

I just ran into this somewhere else. A couple of things that may help:

  • Using sys.executable instead of python to invoke Python in the subprocess
  • If it's a conda env, it needs activation
  • If it's a virtualenv it's easier, but still PATH and PYTHONHOME can be different in the subprocess (see e.g. this SO discussion) so setting those or source activateing may help.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Apr 9, 2020

We're using sys.executable, but the way Python tests run internally is speshul so I wonder if that's related.

@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Apr 9, 2020

subprocess.check_output should by default pass over the env vars. Could we add pprint(os.environ) in the script to make sure that env vars are the problem?

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Apr 9, 2020

As best as I can tell, invoking python from subprocess has never actually worked in our internal build environment (I thought it might work, because there were many subprocess invocations that weren't guarded, but as best I can tell they all accidentally weren't run at all in fbcode for various other reasons). So let's just skip this test when we run it internally. You can case on this using IS_SANDCASTLE, just do a skipIf on your test. Sorry about the hold up.

@Baranowski Baranowski force-pushed the wbaranowski-fd_leak-973 branch from 75e2084 to 3b12e57 Compare April 9, 2020 19:10
@Baranowski Baranowski force-pushed the wbaranowski-fd_leak-973 branch from 3b12e57 to fa096bc Compare April 10, 2020 07:02
@Baranowski
Copy link
Copy Markdown
Contributor Author

@ezyang Done, this should be ready to merge. The CI failure looks unrelated to me.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in 69e3ee2.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…h#34768)

Summary:
Fixes pytorch#973

Common failure scenario:
* DataLoader creates workers and communicates with them through SHMs
* Workers send back through an AF_UNIX socket file descriptors to SHMs containing data
* The limit of open files gets fully used
* A FD gets stripped from a socket message coming back from a worker, without the worker knowing this.
* This causes a `RuntimeError: received 0 items of ancdata` in the standard `multiprocessing` package
* The exception is not handled by PyTorch and so is presented to the users.

After this change the user will see

```
Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/wbaranowski/git/Quansight/pytorch/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 184, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 162, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 551, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 262, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpnx_f6v_f'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_shm_leak.py", line 56, in <module>
    worker_init_fn=worker_init_fn
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 861, in _next_data
    idx, data = self._get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 828, in _get_data
    success, data = self._try_get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 791, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
```
Pull Request resolved: pytorch#34768

Differential Revision: D20538053

Pulled By: ezyang

fbshipit-source-id: be4425cf2fa02aff61619b2b829c153cb1a867cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

file_descriptor sharing strategy may be leaking FDs, resulting in DataLoader causing RuntimeError: received 0 items of ancdata

7 participants