Fix memory leak when using multiple workers on Windows#5585
Fix memory leak when using multiple workers on Windows#5585apaszke merged 10 commits intopytorch:masterfrom
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Great! Would this also solve the cuda OOM? |
|
@ssnl I don't think it's related. Since it's about shared memory but CUDA Tensors are not shared on Windows. |
|
This PR is almost finished. However, there're some racing conditions when workers shutdown in DataLoader. Traceback listed as error type 2. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
apaszke
left a comment
There was a problem hiding this comment.
Not sure what kind of failures is this solving. I think it will still be true that only a single process will close the file (the one that originally opened it), right?
Also, if the only process that can actually free the file is the one that created it, then we will have other problems. E.f what if we'll have children opening mappings and sending them to that main process, and dying before the main proc sets the event. What happens then?
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@ssnl Sorry for wrong answer. I think they may be related now. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@apaszke It's my fault for not having explained it clearly. |
7a3646c to
7e4d852
Compare
|
@apaszke The code was rewritten to use only one daemon thread for a process. I think it should be okay now. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@apaszke Please remember to review this when you are free. |
apaszke
left a comment
There was a problem hiding this comment.
Thanks for the explanation. I think that if this is the case, then basically THRefcountedMapAllocator should behave very similarly like THMapAllocator (it might be enough to make it an alias? not sure what's the semantics of deleting the file while someone else has it open in Windows), because the refcounting is implemented by the Windows kernel, and the file will be alive as long as any process has it open. The only reason why I introduced the refcounted allocator was because there's no way to force POSIX systems to give us the same behavior, and I just had to emulate it with a manual refcount. This would be both much simpler and more efficient.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@apaszke Could you review this again please? |
|
While your changes could have addressed my inline comments in the code (haven't checked yet), I still can't see the reason why we have to add 300 lines of multithreaded code to handle a problem that seems inexistent on Windows. I still think this is not the right approach, and it could be greatly simplified. |
|
@apaszke ExplanationThe release for the FileMapping is described in this MSDN document. CloseHandle(handle);
UnmapViewOfFile(data);However, currently we do it using refcount. That is if (THAtomicDecrementRef(&info->refcount)) {
CloseHandle(ctx->handle);
}
UnmapViewOfFile(data);The refcount is the global count for the all the processes that hold About the simplified solutionYes, you may argue that this implementation is not good enough. However, this issue is really existent. The memory are actually eaten up when you use multiple workers. One way to solve it more elegantly may be to delay the |
5e506c2 to
2a88fbc
Compare
|
@peterjc123 thanks a lot for your detailed write up. This states the problem very clearly. Still, this reassures me that I understood what you said before correctly, and I stand by my hypothesis. I agree that the current behavior (of closing only when refcount reaches zero) is incorrect! My only point is that what you wrote in this patch is unnecessary and could be greatly simplified. The reason I still haven't seen any points against this in your argumentation. I'm not a Windows expert so there might still be some caveats I'm unaware of, but if this is the case then please let me know! |
|
@apaszke Thanks for your review. Yes, we can use |
|
@peterjc123 Ok, so the issue is that we're only sending the name of the shared object in this file, and this doesn't cause its refcount to increase, so we can't close it until it is opened in another process. Do I understand this correctly? Also, is |
|
@apaszke Yes, you are right. The problem is that we cannot get the handle(because the ctx object may be already gone) and don't know when to release it when IO completes. |
|
@apaszke What's your idea about this? |
apaszke
left a comment
There was a problem hiding this comment.
Looks much better - RegisterWaitForSingleObject is super helpful in this context! The code could be still simplified a bit, but I think it's good to merge right after that's fixed.
My only concern with this method is that it will keep the tensors open in all processes for as long as they are alive in any of the processes. It might cause the scripts to run out of free handles if someone has a lot of long-lived shared tensors. Still, it's a good enough fix for now, thanks a lot @peterjc123!
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@apaszke Done! |
|
Thank you @peterjc123! |
The memory leak is caused by the difference in using FileMapping(mmap) on Windows. On Windows, FileMapping objects should be closed by all related processes and then it can be released. And there's no other way to explicitly delete it.(Like shm_unlink)
When multiprocessing is on, the child process will create a FileMapping and then the main process will open it. After that, at some time, the child process will try to release it but it's reference count is non-zero so it cannot be released at that time. But the current code does not provide a chance to let it close again when possible.
This PR targets #5590.
Current Progress:
The memory leak when num_worker=1 should be solved. However, further work has to be done for more workers.
Error type 1(unrelated filemapping handle get killed):
Error type 2(unrelated event handle get killed):