Fix memory leak when using multiple workers on Windows by peterjc123 · Pull Request #5585 · pytorch/pytorch

peterjc123 · 2018-03-06T05:31:56Z

The memory leak is caused by the difference in using FileMapping(mmap) on Windows. On Windows, FileMapping objects should be closed by all related processes and then it can be released. And there's no other way to explicitly delete it.(Like shm_unlink)
When multiprocessing is on, the child process will create a FileMapping and then the main process will open it. After that, at some time, the child process will try to release it but it's reference count is non-zero so it cannot be released at that time. But the current code does not provide a chance to let it close again when possible.
This PR targets #5590.
Current Progress:
The memory leak when num_worker=1 should be solved. However, further work has to be done for more workers.
Error type 1(unrelated filemapping handle get killed):

Traceback (most recent call last):
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 61, in _worker_loop
    data_queue.put((idx, samples))
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\queues.py", line 344, in put
    self._writer.send_bytes(obj)
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid

Error type 2(unrelated event handle get killed):

Traceback (most recent call last):
  File "test.py", line 22, in <module>
    memory_error()
  File "test.py", line 17, in memory_error
    for i in dl:
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 277, in __next__
    idx, batch = self._get_batch()
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 256, in _get_batch
    return self.data_queue.get()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\multiprocessing\reductions.py", line 86, in rebuild_storage_filename
    storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't open shared event: <torch_17608_4052021606_event>, error code: <2> at D:\Projects\pytorch-scripts\pytorch\aten\src\TH\THAllocator.c:245
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x000002303FB255C0>>
Traceback (most recent call last):
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 341, in __del__
    self._shutdown_workers()
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 322, in _shutdown_workers
    self.data_queue.get()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\multiprocessing\reductions.py", line 86, in rebuild_storage_filename
    storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't open shared event: <torch_18984_3257952678_event>, error code: <2> at D:\Projects\pytorch-scripts\pytorch\aten\src\TH\THAllocator.c:245

Sign in to view

ssnl · 2018-03-07T03:39:18Z

Great! Would this also solve the cuda OOM?

peterjc123 · 2018-03-07T03:50:06Z

@ssnl I don't think it's related. Since it's about shared memory but CUDA Tensors are not shared on Windows.

peterjc123 · 2018-03-07T07:48:10Z

This PR is almost finished. However, there're some racing conditions when workers shutdown in DataLoader. Traceback listed as error type 2.

Sign in to view

peterjc123 · 2018-03-08T05:10:52Z

@soumith @apaszke It should be okay now. I've tested it with a very large data and saw no memory leak
or exception after running it for 10 minutes with num_worker set to 8.

apaszke

Not sure what kind of failures is this solving. I think it will still be true that only a single process will close the file (the one that originally opened it), right?

Also, if the only process that can actually free the file is the one that created it, then we will have other problems. E.f what if we'll have children opening mappings and sending them to that main process, and dying before the main proc sets the event. What happens then?

Sign in to view

peterjc123 · 2018-03-08T10:22:39Z

@ssnl Sorry for wrong answer. I think they may be related now.

Sign in to view

peterjc123 · 2018-03-08T13:49:10Z

@apaszke It's my fault for not having explained it clearly. FileMapping is a kernel object on Windows. A kernel object will be released when all references to it are released. When we use FileMapping, we first create it in the child process using CreateFileMapping, and then the parent process will try to open it using OpenFileMapping. The current question is when refcount is down to zero, only one side really closes the FileMapping using CloseHandle, because when the other side tries to close it previously, the refcount of it is nonzero, so the CloseHandle is not triggered. In order to fully release the FileMapping, we need to call CloseHandle on both sides.
You may ask this question. Why don't we close it when the free function is called, since the kernel object is reserved when another reference is not closed? Because when we call CloseHandle, the handle is invalid afterwards. So we cannot use it anymore. But the multiprocessing lib of Python uses Pipe and overlapped I/O (similar to asynchronous IO), so we don't know exactly when the use of the handle is completed. So I think it's safe to close both handles when the refcount of the FileMapping decreases to zero, because this means the transfer process is done.
As for the death of the process, it can be classified into 2 types: child process dies first or parent process dies first.
For the first circumstance, it's actually safe because the parent process will be stuck (or exit on error) and the error message will be printed so that user will know there're problems and try to kill it. If parent process dies first, then the child process will die too. Because the pipe is not usable anymore, the exception will be thrown and the process will be stopped. And after both process are killed, Events are also released. (Because Event is a kernel object too.)

peterjc123 · 2018-03-11T17:27:20Z

@apaszke The code was rewritten to use only one daemon thread for a process. I think it should be okay now.

Sign in to view

peterjc123 · 2018-03-13T02:00:04Z

@apaszke Please remember to review this when you are free.

apaszke

Thanks for the explanation. I think that if this is the case, then basically THRefcountedMapAllocator should behave very similarly like THMapAllocator (it might be enough to make it an alias? not sure what's the semantics of deleting the file while someone else has it open in Windows), because the refcounting is implemented by the Windows kernel, and the file will be alive as long as any process has it open. The only reason why I introduced the refcounted allocator was because there's no way to force POSIX systems to give us the same behavior, and I just had to emulate it with a manual refcount. This would be both much simpler and more efficient.

Sign in to view

peterjc123 · 2018-03-15T08:43:38Z

@apaszke Could you review this again please?

apaszke · 2018-03-15T11:01:26Z

While your changes could have addressed my inline comments in the code (haven't checked yet), I still can't see the reason why we have to add 300 lines of multithreaded code to handle a problem that seems inexistent on Windows. I still think this is not the right approach, and it could be greatly simplified.

peterjc123 · 2018-03-15T14:46:02Z

@apaszke
I think there's still some confusion in your understanding towards the issue, so I'll explain it clearly again.

Explanation

The release for the FileMapping is described in this MSDN document.
That is to say to fully release a FileMapping, we should do the following for all the processes.

CloseHandle(handle);
UnmapViewOfFile(data);

However, currently we do it using refcount. That is

if (THAtomicDecrementRef(&info->refcount)) {
  CloseHandle(ctx->handle);
}
UnmapViewOfFile(data);

The refcount is the global count for the all the processes that hold data, not the local refcount for data inside the process. So when the global refcount is down to zero, it certainly only happen on only one process. One thing to state is that the FileMapping is used on a per tensor basis, which is not a process-level object. Thus the memory leak happens. The FileMapping is not released because the other processes still own the handle to the FileMapping.

About the simplified solution

Yes, you may argue that this implementation is not good enough. However, this issue is really existent. The memory are actually eaten up when you use multiple workers. One way to solve it more elegantly may be to delay the ctx_context_free of the shared tensor and call it when the transfer process of the data is over. But it's really hard to implement. As an alternative, we can use threadpool or sth like that. However, I don't know whether it's enough.

apaszke · 2018-03-18T17:29:09Z

@peterjc123 thanks a lot for your detailed write up. This states the problem very clearly.

Still, this reassures me that I understood what you said before correctly, and I stand by my hypothesis. I agree that the current behavior (of closing only when refcount reaches zero) is incorrect! My only point is that what you wrote in this patch is unnecessary and could be greatly simplified. The reason THMapAllocator exists is exactly because the Linux kernel doesn't provide the same refcounting mechanism for shared memory objects that Windows kernel does. However, this means that using this allocator is pointless on Windows, and it could have been safely replaced with the regular THMapAllocator!

I still haven't seen any points against this in your argumentation. I'm not a Windows expert so there might still be some caveats I'm unaware of, but if this is the case then please let me know!

peterjc123 · 2018-03-19T00:41:53Z

@apaszke Thanks for your review. Yes, we can use THMapAllocator instead of THRefMapAllocator to get rid of the refcount but actually that is not the point. There is sth that blocks the use of this, too. The multiprocessing lib of Python uses Pipe to transfer data across processes under Windows. During this process, the FileMapping must not be closed. But it uses asynchronous IO so the correct closing timing of FileMapping is not easy to get. (Insert code to multiprocessing?)
As an alternative, we can use THRefMapAllocator to confirm the IO process is actually finished when its refcount is down to zero. That is what this PR is currently doing.
Please let me know what you think.

apaszke · 2018-03-19T16:02:49Z

@peterjc123 Ok, so the issue is that we're only sending the name of the shared object in this file, and this doesn't cause its refcount to increase, so we can't close it until it is opened in another process. Do I understand this correctly?

Also, is filesystem the only way of multiprocessing we support at the moment, or does file_descriptor work too?

peterjc123 · 2018-03-20T04:18:21Z

@apaszke Yes, you are right. The problem is that we cannot get the handle(because the ctx object may be already gone) and don't know when to release it when IO completes.
I don't know exactly which one FileMapping should belong to. It uses memory instead of a hard disk. (using a file is also okay for FileMapping) The file_descriptor works on fork but the FileMapping works on spawn. The file_descriptor is shared through a number(handle) while FileMapping uses the same name to specify the same object.

peterjc123 · 2018-03-23T12:30:37Z

@apaszke What's your idea about this?

apaszke

Looks much better - RegisterWaitForSingleObject is super helpful in this context! The code could be still simplified a bit, but I think it's good to merge right after that's fixed.

My only concern with this method is that it will keep the tensors open in all processes for as long as they are alive in any of the processes. It might cause the scripts to run out of free handles if someone has a lot of long-lived shared tensors. Still, it's a good enough fix for now, thanks a lot @peterjc123!

Sign in to view

peterjc123 · 2018-03-28T02:42:00Z

@apaszke Done!

apaszke · 2018-03-28T08:35:35Z

Thank you @peterjc123!

onnxbot-worker-1 mentioned this pull request Mar 6, 2018

[auto] pytorch-pr-5585 onnxbot/onnx-fb-universe#972

Closed

peterjc123 commented Mar 6, 2018

View reviewed changes

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

vadimkantorov reviewed Mar 7, 2018

View reviewed changes

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

peterjc123 changed the title ~~[WIP]Fix memory leak when using multiple workers on Windows~~ Fix memory leak when using multiple workers on Windows Mar 8, 2018

apaszke reviewed Mar 8, 2018

View reviewed changes

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

yf225 mentioned this pull request Mar 9, 2018

Missing components / tests on Windows #4092

Closed

13 tasks

peterjc123 force-pushed the leak_fix branch 2 times, most recently from 7a3646c to 7e4d852 Compare March 11, 2018 17:25

peterjc123 commented Mar 12, 2018

View reviewed changes

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Sign in to view

apaszke suggested changes Mar 13, 2018

View reviewed changes

ezyang mentioned this pull request Mar 13, 2018

0.4 issues tracking #5673

Closed

29 tasks

peterjc123 force-pushed the leak_fix branch from daefbf3 to fa019ef Compare March 14, 2018 06:08

peterjc123 force-pushed the leak_fix branch 2 times, most recently from 5e506c2 to 2a88fbc Compare March 17, 2018 10:36

apaszke reviewed Mar 27, 2018

View reviewed changes

fmassa reviewed Mar 27, 2018

View reviewed changes

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Sign in to view

peterjc123 added 9 commits March 28, 2018 10:34

Fix memory leak when workers=1

ad90a0e

Fix for type1 error and memory leak

1ef238c

Final cleanup

a834922

Remove useless code

672745a

Fix minor issues

edb8eb6

Fix for missing brace

b2ace50

Implement handle closing through thread pool

3d9c0c5

Remove warning block from dataloader

4c7cb1f

Trival fixes

b3abf1e

peterjc123 force-pushed the leak_fix branch from 2a88fbc to b3abf1e Compare March 28, 2018 02:41

Remove local declarations

1dfcd28

apaszke approved these changes Mar 28, 2018

View reviewed changes

apaszke merged commit ae4362b into pytorch:master Mar 28, 2018

peterjc123 deleted the leak_fix branch March 28, 2018 08:39

peterjc123 mentioned this pull request Mar 28, 2018

Memory leak in the windows build (development version) #5590

Closed

yf225 mentioned this pull request Mar 29, 2018

[WIP] Enable tests that use DataLoader with multiple workers on Windows #6096

Closed

peterjc123 mentioned this pull request Oct 3, 2018

[Windows] Exception in Torch.Multiprocessing.Queue #12276

Closed

cocozfz mentioned this pull request May 9, 2019

AttributeError: Can't pickle local object 'get_preprocessing_fn.<locals>._preprocess_input’ qubvel-org/segmentation_models.pytorch#12

Closed

ezyang added the open source label Jun 24, 2019

89x98 mentioned this pull request Jan 28, 2022

There is a problem when I train models Louis-udm/VGCN-BERT#6

Open

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026

Fix memory leak when using multiple workers on Windows (pytorch#5585)

eef4c1e

Conversation

peterjc123 commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ssnl commented Mar 7, 2018

Uh oh!

peterjc123 commented Mar 7, 2018

Uh oh!

peterjc123 commented Mar 7, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

peterjc123 commented Mar 8, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

peterjc123 commented Mar 8, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

peterjc123 commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterjc123 commented Mar 11, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

peterjc123 commented Mar 13, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

peterjc123 commented Mar 15, 2018

Uh oh!

apaszke commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterjc123 commented Mar 15, 2018

Explanation

peterjc123 commented Mar 6, 2018 •

edited

Loading

peterjc123 commented Mar 8, 2018 •

edited

Loading

apaszke commented Mar 15, 2018 •

edited

Loading

apaszke commented Mar 19, 2018 •

edited

Loading