[CI][CUDA] Re-enable the test-nan-assert on CUDA12#154448
[CI][CUDA] Re-enable the test-nan-assert on CUDA12#154448nWEIdia wants to merge 5 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154448
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Cancelled Jobs, 1 Unrelated FailureAs of commit 2f4cadb with merge base 16d05e1 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Looks like this test would still hang. i.e. we cannot add pg._allgather API call either. |
|
Closing based on observations from #154448 (comment) |
|
Re-opening to propose the removal of the API between backend._set_enable_nan_check(False) and backend._set_enable_nan_check(True) in this test_nan_assert function. |
kwen2501
left a comment
There was a problem hiding this comment.
Okay to remove the first all_gather
|
Due to the removal of the first all_gather, the NaN check runs early. The main thread has yet to launch the ncclAllGather kernel. At this point, the CUDA launch call senses a DSA, thus throws an exception ( In that case, we can use I wonder if we can funnel both behavior into SIGABRT, by: |
|
@pytorchbot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Successfully rebased |
|
So before the last change the error was: Expected -6 but got 6. Now I added a "-" sign to 6, then the error becomes: What is 250? -6 in two's complement would be 11111010 (binary) or 250 (decimal) So our error code comparison is in decimal, perhaps we need some post processing to change 11111010 back to -6. |
Change back to sys.exit
value to 6. So hopefully this time it would pass.
|
@pytorchbot merge -f "docker tag missing? does not seem related" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.
I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).
Workaround #153479
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @kwen2501 @ptrblck @eqy @tinglvv @malfet @atalman