Skip to content

[CI][CUDA] Re-enable the test-nan-assert on CUDA12#154448

Closed
nWEIdia wants to merge 5 commits intopytorch:mainfrom
nWEIdia:main-reenable-test-nan-assert
Closed

[CI][CUDA] Re-enable the test-nan-assert on CUDA12#154448
nWEIdia wants to merge 5 commits intopytorch:mainfrom
nWEIdia:main-reenable-test-nan-assert

Conversation

@nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented May 27, 2025

We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround #153479

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @kwen2501 @ptrblck @eqy @tinglvv @malfet @atalman

@pytorch-bot
Copy link

pytorch-bot bot commented May 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154448

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs, 1 Unrelated Failure

As of commit 2f4cadb with merge base 16d05e1 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 27, 2025
@nWEIdia nWEIdia requested a review from kwen2501 May 27, 2025 19:31
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented May 27, 2025

Looks like this test would still hang. i.e. we cannot add pg._allgather API call either.
https://github.com/pytorch/pytorch/actions/runs/15284040734/job/42991439449

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented May 27, 2025

Closing based on observations from #154448 (comment)

@nWEIdia nWEIdia closed this May 27, 2025
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented May 27, 2025

Re-opening to propose the removal of the API between backend._set_enable_nan_check(False) and backend._set_enable_nan_check(True) in this test_nan_assert function.

@nWEIdia nWEIdia reopened this May 27, 2025
@nWEIdia nWEIdia requested a review from kwen2501 May 27, 2025 22:14
@nWEIdia nWEIdia changed the title WIP: [CI][CUDA] Re-enable the test-nan-assert on CUDA12 [CI][CUDA] Re-enable the test-nan-assert on CUDA12 May 27, 2025
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to remove the first all_gather

@kwen2501
Copy link
Collaborator

kwen2501 commented May 31, 2025

Due to the removal of the first all_gather, the NaN check runs early. The main thread has yet to launch the ncclAllGather kernel. At this point, the CUDA launch call senses a DSA, thus throws an exception (Cuda failure 'unspecified launch failure'). This behavior is different from the previous behavior of SIGABRT.

In that case, we can use with self.assertRaises to assert the throw of that exception (instead of SIGABRT testing). But it is still not reliable enough -- too sensitive to timing.

I wonder if we can funnel both behavior into SIGABRT, by:

try:
    pg.all_gather(...)
except Exception:
    sys.exit(signal.abort)

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jun 4, 2025

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased main-reenable-test-nan-assert onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout main-reenable-test-nan-assert && git pull --rebase)

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jun 4, 2025

So before the last change the error was:

Expected -6 but got 6.

Now I added a "-" sign to 6, then the error becomes:
Expected -6 but got 250.

What is 250? -6 in two's complement would be 11111010 (binary) or 250 (decimal)

So our error code comparison is in decimal, perhaps we need some post processing to change 11111010 back to -6.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jun 5, 2025

@pytorchbot merge -f "docker tag missing? does not seem related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (miscellaneous) topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants