Skip to content

Fix memory leak when using multiple workers on Windows#5585

Merged
apaszke merged 10 commits intopytorch:masterfrom
peterjc123:leak_fix
Mar 28, 2018
Merged

Fix memory leak when using multiple workers on Windows#5585
apaszke merged 10 commits intopytorch:masterfrom
peterjc123:leak_fix

Conversation

@peterjc123
Copy link
Copy Markdown
Collaborator

@peterjc123 peterjc123 commented Mar 6, 2018

The memory leak is caused by the difference in using FileMapping(mmap) on Windows. On Windows, FileMapping objects should be closed by all related processes and then it can be released. And there's no other way to explicitly delete it.(Like shm_unlink)
When multiprocessing is on, the child process will create a FileMapping and then the main process will open it. After that, at some time, the child process will try to release it but it's reference count is non-zero so it cannot be released at that time. But the current code does not provide a chance to let it close again when possible.
This PR targets #5590.
Current Progress:
The memory leak when num_worker=1 should be solved. However, further work has to be done for more workers.
Error type 1(unrelated filemapping handle get killed):

Traceback (most recent call last):
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 61, in _worker_loop
    data_queue.put((idx, samples))
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\queues.py", line 344, in put
    self._writer.send_bytes(obj)
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid

Error type 2(unrelated event handle get killed):

Traceback (most recent call last):
  File "test.py", line 22, in <module>
    memory_error()
  File "test.py", line 17, in memory_error
    for i in dl:
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 277, in __next__
    idx, batch = self._get_batch()
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 256, in _get_batch
    return self.data_queue.get()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\multiprocessing\reductions.py", line 86, in rebuild_storage_filename
    storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't open shared event: <torch_17608_4052021606_event>, error code: <2> at D:\Projects\pytorch-scripts\pytorch\aten\src\TH\THAllocator.c:245
Exception ignored in: <bound method DataLoaderIter.__del__ of <torch.utils.data.dataloader.DataLoaderIter object at 0x000002303FB255C0>>
Traceback (most recent call last):
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 341, in __del__
    self._shutdown_workers()
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\utils\data\dataloader.py", line 322, in _shutdown_workers
    self.data_queue.get()
  File "C:\Anaconda2\envs\test_new\lib\multiprocessing\queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "C:\Anaconda2\envs\test_new\lib\site-packages\torch\multiprocessing\reductions.py", line 86, in rebuild_storage_filename
    storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't open shared event: <torch_18984_3257952678_event>, error code: <2> at D:\Projects\pytorch-scripts\pytorch\aten\src\TH\THAllocator.c:245

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Mar 7, 2018

Great! Would this also solve the cuda OOM?

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@ssnl I don't think it's related. Since it's about shared memory but CUDA Tensors are not shared on Windows.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

This PR is almost finished. However, there're some racing conditions when workers shutdown in DataLoader. Traceback listed as error type 2.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

@peterjc123 peterjc123 changed the title [WIP]Fix memory leak when using multiple workers on Windows Fix memory leak when using multiple workers on Windows Mar 8, 2018
@peterjc123
Copy link
Copy Markdown
Collaborator Author

@soumith @apaszke It should be okay now. I've tested it with a very large data and saw no memory leak
or exception after running it for 10 minutes with num_worker set to 8.

Copy link
Copy Markdown
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what kind of failures is this solving. I think it will still be true that only a single process will close the file (the one that originally opened it), right?

Also, if the only process that can actually free the file is the one that created it, then we will have other problems. E.f what if we'll have children opening mappings and sending them to that main process, and dying before the main proc sets the event. What happens then?

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@ssnl Sorry for wrong answer. I think they may be related now.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

peterjc123 commented Mar 8, 2018

@apaszke It's my fault for not having explained it clearly. FileMapping is a kernel object on Windows. A kernel object will be released when all references to it are released. When we use FileMapping, we first create it in the child process using CreateFileMapping, and then the parent process will try to open it using OpenFileMapping. The current question is when refcount is down to zero, only one side really closes the FileMapping using CloseHandle, because when the other side tries to close it previously, the refcount of it is nonzero, so the CloseHandle is not triggered. In order to fully release the FileMapping, we need to call CloseHandle on both sides.
You may ask this question. Why don't we close it when the free function is called, since the kernel object is reserved when another reference is not closed? Because when we call CloseHandle, the handle is invalid afterwards. So we cannot use it anymore. But the multiprocessing lib of Python uses Pipe and overlapped I/O (similar to asynchronous IO), so we don't know exactly when the use of the handle is completed. So I think it's safe to close both handles when the refcount of the FileMapping decreases to zero, because this means the transfer process is done.
As for the death of the process, it can be classified into 2 types: child process dies first or parent process dies first.
For the first circumstance, it's actually safe because the parent process will be stuck (or exit on error) and the error message will be printed so that user will know there're problems and try to kill it. If parent process dies first, then the child process will die too. Because the pipe is not usable anymore, the exception will be thrown and the process will be stopped. And after both process are killed, Events are also released. (Because Event is a kernel object too.)

@yf225 yf225 mentioned this pull request Mar 9, 2018
13 tasks
@peterjc123 peterjc123 force-pushed the leak_fix branch 2 times, most recently from 7a3646c to 7e4d852 Compare March 11, 2018 17:25
@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke The code was rewritten to use only one daemon thread for a process. I think it should be okay now.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke Please remember to review this when you are free.

Copy link
Copy Markdown
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I think that if this is the case, then basically THRefcountedMapAllocator should behave very similarly like THMapAllocator (it might be enough to make it an alias? not sure what's the semantics of deleting the file while someone else has it open in Windows), because the refcounting is implemented by the Windows kernel, and the file will be alive as long as any process has it open. The only reason why I introduced the refcounted allocator was because there's no way to force POSIX systems to give us the same behavior, and I just had to emulate it with a manual refcount. This would be both much simpler and more efficient.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

@ezyang ezyang mentioned this pull request Mar 13, 2018
29 tasks
@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke Could you review this again please?

@apaszke
Copy link
Copy Markdown
Contributor

apaszke commented Mar 15, 2018

While your changes could have addressed my inline comments in the code (haven't checked yet), I still can't see the reason why we have to add 300 lines of multithreaded code to handle a problem that seems inexistent on Windows. I still think this is not the right approach, and it could be greatly simplified.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke
I think there's still some confusion in your understanding towards the issue, so I'll explain it clearly again.

Explanation

The release for the FileMapping is described in this MSDN document.
That is to say to fully release a FileMapping, we should do the following for all the processes.

CloseHandle(handle);
UnmapViewOfFile(data);

However, currently we do it using refcount. That is

if (THAtomicDecrementRef(&info->refcount)) {
  CloseHandle(ctx->handle);
}
UnmapViewOfFile(data);

The refcount is the global count for the all the processes that hold data, not the local refcount for data inside the process. So when the global refcount is down to zero, it certainly only happen on only one process. One thing to state is that the FileMapping is used on a per tensor basis, which is not a process-level object. Thus the memory leak happens. The FileMapping is not released because the other processes still own the handle to the FileMapping.

About the simplified solution

Yes, you may argue that this implementation is not good enough. However, this issue is really existent. The memory are actually eaten up when you use multiple workers. One way to solve it more elegantly may be to delay the ctx_context_free of the shared tensor and call it when the transfer process of the data is over. But it's really hard to implement. As an alternative, we can use threadpool or sth like that. However, I don't know whether it's enough.

@peterjc123 peterjc123 force-pushed the leak_fix branch 2 times, most recently from 5e506c2 to 2a88fbc Compare March 17, 2018 10:36
@apaszke
Copy link
Copy Markdown
Contributor

apaszke commented Mar 18, 2018

@peterjc123 thanks a lot for your detailed write up. This states the problem very clearly.

Still, this reassures me that I understood what you said before correctly, and I stand by my hypothesis. I agree that the current behavior (of closing only when refcount reaches zero) is incorrect! My only point is that what you wrote in this patch is unnecessary and could be greatly simplified. The reason THMapAllocator exists is exactly because the Linux kernel doesn't provide the same refcounting mechanism for shared memory objects that Windows kernel does. However, this means that using this allocator is pointless on Windows, and it could have been safely replaced with the regular THMapAllocator!

I still haven't seen any points against this in your argumentation. I'm not a Windows expert so there might still be some caveats I'm unaware of, but if this is the case then please let me know!

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke Thanks for your review. Yes, we can use THMapAllocator instead of THRefMapAllocator to get rid of the refcount but actually that is not the point. There is sth that blocks the use of this, too. The multiprocessing lib of Python uses Pipe to transfer data across processes under Windows. During this process, the FileMapping must not be closed. But it uses asynchronous IO so the correct closing timing of FileMapping is not easy to get. (Insert code to multiprocessing?)
As an alternative, we can use THRefMapAllocator to confirm the IO process is actually finished when its refcount is down to zero. That is what this PR is currently doing.
Please let me know what you think.

@apaszke
Copy link
Copy Markdown
Contributor

apaszke commented Mar 19, 2018

@peterjc123 Ok, so the issue is that we're only sending the name of the shared object in this file, and this doesn't cause its refcount to increase, so we can't close it until it is opened in another process. Do I understand this correctly?

Also, is filesystem the only way of multiprocessing we support at the moment, or does file_descriptor work too?

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke Yes, you are right. The problem is that we cannot get the handle(because the ctx object may be already gone) and don't know when to release it when IO completes.
I don't know exactly which one FileMapping should belong to. It uses memory instead of a hard disk. (using a file is also okay for FileMapping) The file_descriptor works on fork but the FileMapping works on spawn. The file_descriptor is shared through a number(handle) while FileMapping uses the same name to specify the same object.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke What's your idea about this?

Copy link
Copy Markdown
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better - RegisterWaitForSingleObject is super helpful in this context! The code could be still simplified a bit, but I think it's good to merge right after that's fixed.

My only concern with this method is that it will keep the tensors open in all processes for as long as they are alive in any of the processes. It might cause the scripts to run out of free handles if someone has a lot of long-lived shared tensors. Still, it's a good enough fix for now, thanks a lot @peterjc123!

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

Comment thread aten/src/TH/THAllocator.c Outdated

This comment was marked as off-topic.

@peterjc123
Copy link
Copy Markdown
Collaborator Author

@apaszke Done!

@apaszke apaszke merged commit ae4362b into pytorch:master Mar 28, 2018
@apaszke
Copy link
Copy Markdown
Contributor

apaszke commented Mar 28, 2018

Thank you @peterjc123!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants