Skip to content

Add isnan exit condition to special ops#157464

Closed
malfet wants to merge 8 commits intogh/malfet/427/basefrom
gh/malfet/427/head
Closed

Add isnan exit condition to special ops#157464
malfet wants to merge 8 commits intogh/malfet/427/basefrom
gh/malfet/427/head

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Jul 2, 2025

Stack from ghstack (oldest at bottom):

They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind

for (int64_t k = 2; k <= n; k++) {
r = (x + x) * q - p;
p = q;
q = r;
}

which were tested by test_compare_cpu using following values (as sample index 16)
_large_float_vals = _large_float16_vals + (-4988429.2, 4988429.2, -1e20, 1e20)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below

python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)

Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before

$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)

After

$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)

Fixes #79528

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157464

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 556cebd with merge base 0f9c1b3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@malfet malfet added the topic: not user facing topic category label Jul 2, 2025
[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 2, 2025
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2025
@malfet
Copy link
Contributor Author

malfet commented Jul 2, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor Author

malfet commented Jul 2, 2025

@pytorchbot merge -f "Don't think binary builds with discover anything new"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 2, 2025
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from https://github.com/pytorch/pytorch/blob/94716db22214912896cf680dc3eb88574f611a42/aten/src/ATen/native/cuda/Math.cuh#L2821

Pull Request resolved: #157488
Approved by: https://github.com/dcci
ghstack dependencies: #157464
@clee2000
Copy link
Contributor

clee2000 commented Jul 3, 2025

@pytorchbot revert -m "caused slow test config to time out GH job link HUD commit link" -c nosignal

Looking at the logs I see lines like:

test_ops.py::TestCommonCUDA::test_compare_cpu_special_chebyshev_polynomial_v_cuda_float32 Command took >60min, returning 124
test_ops.py::TestCommonCUDA::test_compare_cpu_special_chebyshev_polynomial_t_cuda_float32 Command took >60min, returning 124

I'm not sure what's going on, but also found that some unrelated tests take much longer after this change too, maybe resource starvation due to running in parallel? I think they used to take <500s:

test_decomp.py::TestDecompCUDA::test_comprehensive_grid_sampler_2d_cuda_float16 PASSED [3255.6216s] [ 22%]
test_decomp.py::TestDecompCUDA::test_comprehensive_nn_functional_grid_sample_cuda_float64 PASSED [2497.9783s] [ 39%]

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@malfet your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 3, 2025
[ghstack-poisoned]
malfet added a commit that referenced this pull request Jul 3, 2025
They were slow on CUDA-11.3, which has long been gone, let's see if they
work now

Fixes #79528

ghstack-source-id: 586b1fb
Pull Request resolved: #157464
@malfet
Copy link
Contributor Author

malfet commented Jul 3, 2025

Wow PYTORCH_TEST_WITH_SLOW=1 python3 test_ops.py -v -k test_compare_cpu_special_chebyshev_polynomial_v_cuda_float32 indeed takes forever...

[ghstack-poisoned]
@malfet malfet requested review from eqy and syed-ahmed as code owners July 4, 2025 04:05
[ghstack-poisoned]
[ghstack-poisoned]
@malfet malfet changed the title [BE] Unskip special ops Add isnan exit condition to special ops Jul 4, 2025
[ghstack-poisoned]
malfet added a commit that referenced this pull request Jul 4, 2025
They were slow on CUDA-11.3, which has long been gone, let's see if they
work now

Fixes #79528

ghstack-source-id: 65c7302
Pull Request resolved: #157464
@malfet
Copy link
Contributor Author

malfet commented Jul 4, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/malfet/427/head branch August 5, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants