bypass `getDeviceFromPtr` check when device is known by emcastillo · Pull Request #36714 · pytorch/pytorch

emcastillo · 2020-04-16T04:47:16Z

In some cases, when using memory that was allocated in another process before doing any memory-related operation in PyTorch, there are errors because the GPU CUDA context is not completely initialized.

I guess there is an explicit reason to leave the context not initialized at first, and don't do it in THCudaInit where other CUDA calls are going on.
I'd like to discuss it in this PR.

Possible better solutions are
Initialize the device context in fromDLPack or from_blob, probably by creating some dummy array with one element. But this feels like a hack.

Another possibility is to catch the exception in getDeviceFromPtr, check if the context was initialized, and if not repeat this operation. but we will need to check for every device.

This PR bypasses the getDeviceFromPtr call which is the one causing the problem if we already know the device. This allows us to create the Tensor from the shared memory storage but the context will not be initialized. However, it will be when the tensor is accessed later.

dr-ci · 2020-04-16T04:48:20Z

💊 CI failures summary and remediations

As of commit f7bcd98 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs
1/2 broken upstream at merge base 65260d4 from May 06 until May 07 (20 commits; 28ed04c - 7be9796)

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_bazel_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun) ❄️

May 07 08:23:52 TIMEOUT: //:optim_test (Summary)

May 07 08:23:50 [       OK ] TensorTest.SetData (0 ms) 
May 07 08:23:50 [ RUN      ] TensorTest.RequiresGradInplace 
May 07 08:23:50 [       OK ] TensorTest.RequiresGradInplace (0 ms) 
May 07 08:23:50 [----------] 39 tests from TensorTest (31 ms total) 
May 07 08:23:50  
May 07 08:23:50 [----------] Global test environment tear-down 
May 07 08:23:50 [==========] 39 tests from 1 test suite ran. (32 ms total) 
May 07 08:23:50 [  PASSED  ] 39 tests. 
May 07 08:23:50 ================================================================================ 
May 07 08:23:52  
May 07 08:23:52 TIMEOUT: //:optim_test (Summary) 
May 07 08:23:52       /var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/bazel-out/k8-fastbuild/testlogs/optim_test/test.log 
May 07 08:23:52 INFO: From Testing //:optim_test: 
May 07 08:23:52 ==================== Test output for //:optim_test: 
May 07 08:23:52 Running main() from gmock_main.cc 
May 07 08:23:52 Note: Google Test filter = -*_CUDA 
May 07 08:23:52 [==========] Running 28 tests from 1 test suite. 
May 07 08:23:52 [----------] Global test environment set-up. 
May 07 08:23:52 [----------] 28 tests from OptimTest 
May 07 08:23:52 [ RUN      ] OptimTest.OptimizerAccessors 
May 07 08:23:52 [       OK ] OptimTest.OptimizerAccessors (2 ms)

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_backward_compatibility_check_test from May 06 until May 07 (20 commits; 28ed04c - 7be9796)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 14 times.

ngimel · 2020-05-06T17:12:57Z

@emcastillo Can you please rebase, PRs that are too old can't be merged.

emcastillo · 2020-05-07T07:17:51Z

Rebased! Thanks 😊

facebook-github-bot