Move stream to thread local#13080
Conversation
Summary: This is the first step to untangle this logic: - moves stream id to thread local mechanically - relies on the fact that the value of thread local is valid in conjunction with CUDAContext only until the next SwitchToDevice is called - we should move to proper RAII in the following diffs Follow up diffs are going to move more stuff outside of CUDAContext (by making gpu_id thread local too) and simplify the CopyFrom. The only expected change in behavior is that before CopyFrom would do copy on stream logical id 0 if the context was created on the fly and now it'd do so on the current stream. Since it'd block explicitly, I don't think it matters much. Also, observers were semi-broken by waiting on the potentially wrong stream. It can be fixed later - I renamed the method to avoid abuse. Differential Revision: D10525134 fbshipit-source-id: 7810c458cc6beef94f255fe11a044252cc38f4f9
ezyang
left a comment
There was a problem hiding this comment.
Thank you, this is very helpful.
Summary: Pull Request resolved: pytorch/pytorch#13080 This is the first step to untangle this logic: - moves stream id to thread local mechanically - relies on the fact that the value of thread local is valid in conjunction with CUDAContext only until the next SwitchToDevice is called - we should move to proper RAII in the following diffs Follow up diffs are going to move more stuff outside of CUDAContext (by making gpu_id thread local too) and simplify the CopyFrom. The only expected change in behavior is that before CopyFrom would do copy on stream logical id 0 if the context was created on the fly and now it'd do so on the current stream. Since it'd block explicitly, I don't think it matters much. Also, observers were semi-broken by waiting on the potentially wrong stream. It can be fixed later - I renamed the method to avoid abuse. Reviewed By: ezyang Differential Revision: D10525134 fbshipit-source-id: 5d495a21490bebe060a76389f1b47bdf12cbc59e
|
This diff broke Caffe2 ROCm tests: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/629//console Might just be a failure to apply this code change to the HIP code as well. |
|
@ezyang Yep the context_gpu code change need to be applied to context_hip as well (the test is automatically hipified but the context_* source code is not). I'm working on hipifying caffe2/core in #13148, I just rebased my PR to include this and triggered a test run here https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-trigger-test/3525/, let's see the failure will go away. New run with #gpu checks https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-trigger-test/3609/ |
| auto before_stream = context_outer.cuda_stream(); | ||
|
|
||
| // try to mess up current device | ||
| CUDAContext context_different_device(1); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Summary: Pull Request resolved: pytorch#13080 This is the first step to untangle this logic: - moves stream id to thread local mechanically - relies on the fact that the value of thread local is valid in conjunction with CUDAContext only until the next SwitchToDevice is called - we should move to proper RAII in the following diffs Follow up diffs are going to move more stuff outside of CUDAContext (by making gpu_id thread local too) and simplify the CopyFrom. The only expected change in behavior is that before CopyFrom would do copy on stream logical id 0 if the context was created on the fly and now it'd do so on the current stream. Since it'd block explicitly, I don't think it matters much. Also, observers were semi-broken by waiting on the potentially wrong stream. It can be fixed later - I renamed the method to avoid abuse. Reviewed By: ezyang Differential Revision: D10525134 fbshipit-source-id: 5d495a21490bebe060a76389f1b47bdf12cbc59e
Summary:
This is the first step to untangle this logic:
Follow up diffs are going to move more stuff outside of CUDAContext (by making gpu_id thread local too) and simplify the CopyFrom.
The only expected change in behavior is that before CopyFrom would do copy on stream logical id 0 if the context was created on the fly and now it'd do so on the current stream. Since it'd block explicitly, I don't think it matters much.
Also, observers were semi-broken by waiting on the potentially wrong stream. It can be fixed later - I renamed the method to avoid abuse.
Differential Revision: D10525134