Add isnan exit condition to special ops by malfet · Pull Request #157464 · pytorch/pytorch

malfet · 2025-07-02T16:08:52Z

Stack from ghstack (oldest at bottom):

-> Add isnan exit condition to special ops #157464

They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind

pytorch/aten/src/ATen/native/Math.h

Lines 2969 to 2973 in 7081b82

    
           for (int64_t k = 2; k <= n; k++) { 
        
               r = (x + x) * q - p; 
        
               p = q; 
        
               q = r; 
        
           }

which were tested by test_compare_cpu using following values (as sample index 16)

pytorch/torch/testing/_internal/opinfo/core.py

Line 2079 in 7081b82

_large_float_vals = _large_float16_vals + (-4988429.2, 4988429.2, -1e20, 1e20)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below

python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)

Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before

$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)

After

$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)

Fixes #79528

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-07-02T16:08:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157464

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 556cebd with merge base 0f9c1b3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

malfet · 2025-07-02T19:26:09Z

@pytorchbot merge

pytorchmergebot · 2025-07-02T19:27:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-07-02T23:14:52Z

@pytorchbot merge -f "Don't think binary builds with discover anything new"

pytorchmergebot · 2025-07-02T23:15:10Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2025-07-02T23:16:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

For eager and inductor As for all other chebyshev ops, logic is simply compiled from https://github.com/pytorch/pytorch/blob/94716db22214912896cf680dc3eb88574f611a42/aten/src/ATen/native/cuda/Math.cuh#L2821 Pull Request resolved: #157488 Approved by: https://github.com/dcci ghstack dependencies: #157464

clee2000 · 2025-07-03T15:21:47Z

@pytorchbot revert -m "caused slow test config to time out GH job link HUD commit link" -c nosignal

Looking at the logs I see lines like:

test_ops.py::TestCommonCUDA::test_compare_cpu_special_chebyshev_polynomial_v_cuda_float32 Command took >60min, returning 124
test_ops.py::TestCommonCUDA::test_compare_cpu_special_chebyshev_polynomial_t_cuda_float32 Command took >60min, returning 124

I'm not sure what's going on, but also found that some unrelated tests take much longer after this change too, maybe resource starvation due to running in parallel? I think they used to take <500s:

test_decomp.py::TestDecompCUDA::test_comprehensive_grid_sampler_2d_cuda_float16 PASSED [3255.6216s] [ 22%]
test_decomp.py::TestDecompCUDA::test_comprehensive_nn_functional_grid_sample_cuda_float64 PASSED [2497.9783s] [ 39%]

pytorchmergebot · 2025-07-03T15:24:01Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 9620994. Reverted #157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/e124a0d88ca2aa04bfaca2dcabf5de6244048e45) ([comment](#157464 (comment)))

This reverts commit e124a0d. Reverted #157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/e124a0d88ca2aa04bfaca2dcabf5de6244048e45) ([comment](#157464 (comment)))

pytorchmergebot · 2025-07-03T15:24:23Z

@malfet your PR has been successfully reverted.

[ghstack-poisoned]

They were slow on CUDA-11.3, which has long been gone, let's see if they work now Fixes #79528 ghstack-source-id: 586b1fb Pull Request resolved: #157464

malfet · 2025-07-03T16:33:33Z

Wow PYTORCH_TEST_WITH_SLOW=1 python3 test_ops.py -v -k test_compare_cpu_special_chebyshev_polynomial_v_cuda_float32 indeed takes forever...

[ghstack-poisoned]

They were slow on CUDA-11.3, which has long been gone, let's see if they work now Fixes #79528 ghstack-source-id: 65c7302 Pull Request resolved: #157464

malfet · 2025-07-04T23:54:17Z

@pytorchbot merge

pytorchmergebot · 2025-07-04T23:56:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

08913f1

[ghstack-poisoned]

Skylion007 approved these changes Jul 2, 2025

View reviewed changes

Update

2db2b08

[ghstack-poisoned]

malfet added the topic: not user facing topic category label Jul 2, 2025

Update

f0363fb

[ghstack-poisoned]

malfet requested review from Chillee, kshitij12345 and zou3519 as code owners July 2, 2025 18:44

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 2, 2025

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2025

pytorchmergebot added the merging label Jul 2, 2025

malfet mentioned this pull request Jul 2, 2025

[MPS] Add shifted_chebyshev_polynomial_[tuvw] #157488

Closed

pytorchmergebot closed this in e124a0d Jul 2, 2025

pytorchmergebot added Merged and removed merging labels Jul 2, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 3, 2025

pytorchmergebot reopened this Jul 3, 2025

Update

47a764b

[ghstack-poisoned]

malfet added a commit that referenced this pull request Jul 3, 2025

[BE] Unskip special ops

98a6808

They were slow on CUDA-11.3, which has long been gone, let's see if they work now Fixes #79528 ghstack-source-id: 586b1fb Pull Request resolved: #157464

Update

86e23f1

[ghstack-poisoned]

malfet requested review from eqy and syed-ahmed as code owners July 4, 2025 04:05

Update

2b5094b

[ghstack-poisoned]

malfet added the ciflow/slow label Jul 4, 2025

dcci approved these changes Jul 4, 2025

View reviewed changes

Update

6999328

[ghstack-poisoned]

malfet changed the title ~~[BE] Unskip special ops~~ Add isnan exit condition to special ops Jul 4, 2025

Update

556cebd

[ghstack-poisoned]

malfet added a commit that referenced this pull request Jul 4, 2025

[BE] Unskip special ops

da57064

They were slow on CUDA-11.3, which has long been gone, let's see if they work now Fixes #79528 ghstack-source-id: 65c7302 Pull Request resolved: #157464

pytorchmergebot added the merging label Jul 4, 2025

pytorchmergebot closed this in a952956 Jul 5, 2025

pytorchmergebot removed the merging label Jul 5, 2025

github-actions bot deleted the gh/malfet/427/head branch August 5, 2025 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add isnan exit condition to special ops#157464

Add isnan exit condition to special ops#157464
malfet wants to merge 8 commits intogh/malfet/427/basefrom
gh/malfet/427/head

malfet commented Jul 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

malfet commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

malfet commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

clee2000 commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

malfet commented Jul 3, 2025

Uh oh!

malfet commented Jul 4, 2025

Uh oh!

pytorchmergebot commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	for (int64_t k = 2; k <= n; k++) {
	r = (x + x) * q - p;
	p = q;
	q = r;
	}

Conversation

malfet commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157464

✅ No Failures

Uh oh!

malfet commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Merge started

Uh oh!

malfet commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Uh oh!

pytorchmergebot commented Jul 2, 2025

Merge started

Uh oh!

clee2000 commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

pytorchmergebot commented Jul 3, 2025

Uh oh!

malfet commented Jul 3, 2025

Uh oh!

malfet commented Jul 4, 2025

Uh oh!

pytorchmergebot commented Jul 4, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Jul 2, 2025 •

edited

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading