Fix CUDA error not getting captured by handler#92227
Fix CUDA error not getting captured by handler#92227r-barnes wants to merge 2 commits intopytorch:masterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92227
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 FailuresAs of commit 355e925: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
9cba23f to
5c9e5a6
Compare
5c9e5a6 to
bde810e
Compare
74a8dae to
ee88aa9
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-arm64-mps / Run MPS tests Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "MacOS failures are clearly unrelated to CUDA-specific change" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
#93192) Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256. The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI? ``` python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time ``` E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192 ``` 2023-01-27T19:41:32.2312162Z test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s) ``` The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42 This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503 > soumith on Feb 21, 2019: Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do. Pull Request resolved: #93192 Approved by: https://github.com/ezyang
Fixes #91758. Still leaves functions on the hotpath.