Skip to content

Add meta tensor support for _amp_foreach_non_finite_check_and_unscale_ and nan_to_num#94633

Closed
wonjoo-wj wants to merge 3 commits intomainfrom
meta-tensor
Closed

Add meta tensor support for _amp_foreach_non_finite_check_and_unscale_ and nan_to_num#94633
wonjoo-wj wants to merge 3 commits intomainfrom
meta-tensor

Conversation

@wonjoo-wj
Copy link
Copy Markdown
Collaborator

@wonjoo-wj wonjoo-wj commented Feb 10, 2023

Fixes #92916


Add meta tensor support for _amp_foreach_non_finite_check_and_unscale_ and nan_to_num

cc @alanwaketan

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 10, 2023

@wonjoo-wj
Copy link
Copy Markdown
Collaborator Author

@bdhirsh, even with these changes cherry-picked onto my pytorch/functioanlization branch, I still see the same errors:

/opt/conda/lib/python3.8/site-packages/torch/_functorch/deprecated.py:93: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.functionalize is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use torch.func.functionalize instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('functionalize')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/torch/_functorch/vmap.py", line 39, in fn
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/_functorch/eager_transforms.py", line 1582, in wrapped
    func_outputs = func(*func_args, **func_kwargs)
  File "<stdin>", line 5, in test
NotImplementedError: Could not run 'aten::_amp_foreach_non_finite_check_and_unscale_' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_amp_foreach_non_finite_check_and_unscale_' is only available for these backends: [XLA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

Anything obvious I'm missing in these changes? Thanks a lot!

@wonjoo-wj wonjoo-wj added the topic: not user facing topic category label Feb 10, 2023
@wonjoo-wj wonjoo-wj self-assigned this Feb 10, 2023
@bdhirsh
Copy link
Copy Markdown
Collaborator

bdhirsh commented Feb 16, 2023

@wonjoolee95 it looks like it's because nan_to_num's last 3 arguments are all defaultable (you need to include the defaults in your decomp - our tests probably try to call nan_to_num with just one arguments and expect the defaults to get filled in).

@wonjoo-wj wonjoo-wj force-pushed the meta-tensor branch 2 times, most recently from 4e8fc43 to 3f30969 Compare February 17, 2023 22:21
@wonjoo-wj
Copy link
Copy Markdown
Collaborator Author

The CIs are now looking a lot greener, the failing tests are failing with seemingly unrelated error:

Warning: Failed to download action 'https://api.github.com/repos/actions/upload-artifact/tarball/0b7f8abb1508181956e8e162db84b466c27e18ce'. Error: Response status code does not indicate success: 500 (Internal Server Error).
Warning: Back off 20.144 seconds before retry.
Error: Response status code does not indicate success: 500 (Internal Server Error).

I'll give it a retry.

However, I'm still seeing the same error at #94633 (comment) even with this. Looking into it more.

@wonjoo-wj
Copy link
Copy Markdown
Collaborator Author

Synced with Brian offline, putting some information here. I was able to verify that Meta registration shows up:

>>> print(torch._C._dispatch_dump("aten::nan_to_num.out"))
name: aten::nan_to_num.out
schema: aten::nan_to_num.out(Tensor self, float? nan=None, float? posinf=None, float? neginf=None, *, Tensor(a!) out) -> Tensor(a!)
debug: registered at /workspace/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
alias analysis kind: FROM_SCHEMA
Functionalize: registered at /workspace/pytorch/build/aten/src/ATen/RegisterFunctionalization_3.cpp:23953 :: (Tensor _0, float? _1, float? _2, float? _3, Tensor _4) -> Tensor _0 [ boxed unboxed ]
...
Meta: registered at /dev/null:219 :: (none) [ boxed ]
SparseCPU: registered at /workspace/pytorch/build/aten/src/ATen/RegisterSparseCPU.cpp:1379 :: (Tensor _0, float? _1, float? _2, float? _3, Tensor _4) -> Tensor _0 [ boxed unboxed ]
Autograd[alias]: registered at /workspace/pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:17909 :: (Tensor _0, float? _1, float? _2, float? _3, Tensor _4) -> Tensor _0 [ boxed unboxed ]

With that said, the none at Meta: registered at /dev/null:219 :: (none) is a bit suspicious as other registration don't have such none.

@wonjoo-wj
Copy link
Copy Markdown
Collaborator Author

Oddly enough, I can actually see that these ops work as intended in a Python intepretor:

x = torch.tensor([float('nan'), float('inf'), -float('inf'), 3.14])
x.nan_to_num_(1.0, 2.0, 3.0)
x_xla = torch.tensor([float('nan'), float('inf'), -float('inf'), 3.14], device='xla:0')
x_xla.nan_to_num_(1.0, 2.0, 3.0)

@wonjoo-wj
Copy link
Copy Markdown
Collaborator Author

Closed with pytorch/xla#4687.

@wonjoo-wj wonjoo-wj closed this Apr 24, 2023
@github-actions github-actions Bot deleted the meta-tensor branch August 20, 2024 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Functionalization] Some ops need additional meta tensor support after functionalization

3 participants