[CI][CUDA] Re-enable the test-nan-assert on CUDA12 by nWEIdia · Pull Request #154448 · pytorch/pytorch

nWEIdia · 2025-05-27T19:30:41Z

We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround #153479

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @kwen2501 @ptrblck @eqy @tinglvv @malfet @atalman

pytorch-bot · 2025-05-27T19:30:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154448

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs, 1 Unrelated Failure

As of commit 2f4cadb with merge base 16d05e1 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-cuda12.8-cudnn9-py3.9-clang12 / build (gh)

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501

Thanks!

nWEIdia · 2025-05-27T20:55:20Z

Looks like this test would still hang. i.e. we cannot add pg._allgather API call either.
https://github.com/pytorch/pytorch/actions/runs/15284040734/job/42991439449

nWEIdia · 2025-05-27T21:18:54Z

Closing based on observations from #154448 (comment)

nWEIdia · 2025-05-27T22:12:19Z

Re-opening to propose the removal of the API between backend._set_enable_nan_check(False) and backend._set_enable_nan_check(True) in this test_nan_assert function.

kwen2501

Okay to remove the first all_gather

kwen2501 · 2025-05-31T07:48:43Z

Due to the removal of the first all_gather, the NaN check runs early. The main thread has yet to launch the ncclAllGather kernel. At this point, the CUDA launch call senses a DSA, thus throws an exception (Cuda failure 'unspecified launch failure'). This behavior is different from the previous behavior of SIGABRT.

In that case, we can use with self.assertRaises to assert the throw of that exception (instead of SIGABRT testing). But it is still not reliable enough -- too sensitive to timing.

I wonder if we can funnel both behavior into SIGABRT, by:

try:
    pg.all_gather(...)
except Exception:
    sys.exit(signal.abort)

nWEIdia · 2025-06-04T06:25:43Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-06-04T06:27:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-06-04T06:27:24Z

Successfully rebased main-reenable-test-nan-assert onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main-reenable-test-nan-assert && git pull --rebase)

nWEIdia · 2025-06-04T16:38:56Z

So before the last change the error was:

Expected -6 but got 6.

Now I added a "-" sign to 6, then the error becomes:
Expected -6 but got 250.

What is 250? -6 in two's complement would be 11111010 (binary) or 250 (decimal)

So our error code comparison is in decimal, perhaps we need some post processing to change 11111010 back to -6.

Change back to sys.exit

value to 6. So hopefully this time it would pass.

nWEIdia · 2025-06-05T02:07:45Z

@pytorchbot merge -f "docker tag missing? does not seem related"

pytorchmergebot · 2025-06-05T02:09:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 27, 2025

nWEIdia requested a review from kwen2501 May 27, 2025 19:31

pytorchbot added the open source label May 27, 2025

kwen2501 approved these changes May 27, 2025

View reviewed changes

nWEIdia mentioned this pull request May 27, 2025

[c10d] Add support for testing SIGABRT return #153167

Closed

nWEIdia closed this May 27, 2025

nWEIdia reopened this May 27, 2025

nWEIdia requested a review from kwen2501 May 27, 2025 22:14

nWEIdia changed the title ~~WIP: [CI][CUDA] Re-enable the test-nan-assert on CUDA12~~ [CI][CUDA] Re-enable the test-nan-assert on CUDA12 May 27, 2025

nWEIdia mentioned this pull request May 27, 2025

[c10d][CI] Change expected return code in Sandcastle for Nan tests #154441

Closed

kwen2501 approved these changes May 31, 2025

View reviewed changes

nWEIdia requested review from EikanWang, angelayi, avikchaudhuri, bobrenjc93, eqy, gujinghui, jeffdaily, jithunnair-amd, laithsakka, syed-ahmed, tugsbayasgalan, ydwu4 and zhxchen17 as code owners June 2, 2025 05:31

nWEIdia removed request for albanD, angelayi, avikchaudhuri, bobrenjc93, divyanshk, jeffdaily, jerryzh168, justinchuby, kulinseth, laithsakka, lezcano, malfet, soulitzer, syed-ahmed, wschin, ydwu4, zhxchen17 and zou3519 June 2, 2025 05:35

nWEIdia added 2 commits June 1, 2025 22:39

Try Ke's suggestion: raise signal.SIGABRT in test_nan_assert

5df678e

Negate the signal.SIGABRT

b71c666

nWEIdia added 3 commits June 4, 2025 09:44

Use os._exit() that bypass the step of converting -6 to 250.

0bc4f93

Make TEST_NAN_ASSERT_RETURN 250 to match runtime return code.

9ef02c4

Change back to sys.exit

Change back to signal.SIGABRT (6), change -signal.SIGABRT (-6) expected

2f4cadb

value to 6. So hopefully this time it would pass.

nWEIdia mentioned this pull request Jun 4, 2025

[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 #154097

Closed

Conversation

nWEIdia commented May 27, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154448

❌ 1 New Failure, 2 Cancelled Jobs, 1 Unrelated Failure

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented May 27, 2025

Uh oh!

nWEIdia commented May 27, 2025

Uh oh!

nWEIdia commented May 27, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Jun 4, 2025

Uh oh!

pytorchmergebot commented Jun 4, 2025

Uh oh!

pytorchmergebot commented Jun 4, 2025

Uh oh!

nWEIdia commented Jun 4, 2025

Uh oh!

nWEIdia commented Jun 5, 2025

Uh oh!

pytorchmergebot commented Jun 5, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nWEIdia commented May 27, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 27, 2025 •

edited

Loading

kwen2501 commented May 31, 2025 •

edited

Loading