Skip to content

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally #93192

Closed
xwang233 wants to merge 2 commits intopytorch:masterfrom
xwang233:c10-cuda-check-add-cudaGetLastError
Closed

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally #93192
xwang233 wants to merge 2 commits intopytorch:masterfrom
xwang233:c10-cuda-check-add-cudaGetLastError

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 commented Jan 28, 2023

Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally

This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256.

The unit test TestCuda.test_events_multi_gpu_elapsed_time has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI?

python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time

E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192

2023-01-27T19:41:32.2312162Z   test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s)

The original C10_CUDA_CHECK before #85256 has an extra cudaGetLastError that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42

This extra cudaGetLastError was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503

soumith on Feb 21, 2019:
Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do.

@xwang233 xwang233 requested review from ngimel and ptrblck January 28, 2023 01:56
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93192

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1ea494e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xwang233
Copy link
Copy Markdown
Collaborator Author

cc @ezyang @r-barnes

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-barnes ptal

@xwang233
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge -g

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 28, 2023
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants