Make CUDA exceptions unlikely and isolate C10_CUDA_CHECK body#85256
Make CUDA exceptions unlikely and isolate C10_CUDA_CHECK body#85256r-barnes wants to merge 1 commit intopytorch:masterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85256
Note: Links to docs will display an error until the docs builds have been completed. ✅ No Failures, 4 PendingAs of commit a5797e6: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D39619861 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D39619861 |
df3ba51 to
a18dd98
Compare
|
This pull request was exported from Phabricator. Differential Revision: D39619861 |
a18dd98 to
57e69a2
Compare
…h#85256) Summary: Pull Request resolved: pytorch#85256 Test Plan: Sandcastle Differential Revision: D39619861 fbshipit-source-id: 8910ada87d8d0fc95e7dfcf8d6b6eb2f90e2625b
3b9dd5e to
a5797e6
Compare
| ) | ||
| set(C10_CUDA_HEADERS | ||
| CUDAException.h | ||
| CUDAFunctions.h |
There was a problem hiding this comment.
Just alphabetizing these since I'm here.
| set(C10_CUDA_SRCS | ||
| CUDAStream.cpp | ||
| CUDACachingAllocator.cpp | ||
| CUDAException.cpp |
There was a problem hiding this comment.
Add this, alphabetize the rest.
|
@pytorchbot merge -f "test failure unrelated" |
|
@pytorchbot successfully started a merge job. Check the current status here. |
|
Hey @r-barnes. |
This marks CUDA exception checks as unlikely, which might have a positive performance impact. If further isolates part of `C10_CUDA_CHECK` into a separate function and file so that code can be made more expressive in subsequent diffs without bloating functions using the check or creating readability issues. Test Plan: Sandcastle Differential Revision: D39619861 Pull Request resolved: #85256 Approved by: https://github.com/ezyang, https://github.com/ngimel
#93192) Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256. The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI? ``` python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time ``` E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192 ``` 2023-01-27T19:41:32.2312162Z test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s) ``` The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42 This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503 > soumith on Feb 21, 2019: Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do. Pull Request resolved: #93192 Approved by: https://github.com/ezyang
This marks CUDA exception checks as unlikely, which might have a positive performance impact.
If further isolates part of
C10_CUDA_CHECKinto a separate function and file so that code can be made more expressive in subsequent diffs without bloating functions using the check or creating readability issues.Test Plan: Sandcastle
Differential Revision: D39619861