Skip to content

Attn bias#1

Open
alugorey wants to merge 11 commits intoupdate_ck_newestfrom
attn_bias
Open

Attn bias#1
alugorey wants to merge 11 commits intoupdate_ck_newestfrom
attn_bias

Conversation

@alugorey
Copy link
Owner

Fixes #ISSUE_NUMBER

pytorchmergebot pushed a commit that referenced this pull request Mar 1, 2025
…pytorch#144120) (pytorch#146372)

Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: Dao-AILab/flash-attention#1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: pytorch#146372
Approved by: https://github.com/jbschlosser
alugorey pushed a commit that referenced this pull request Jun 2, 2025
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`

`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's  [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32)
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references

Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'

import torch

x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
    x[y] = 2
    print(x)
except torch.AcceleratorError as e:
    print("Exception was raised", e.args[0])
    print("Captured error code is ", e.error_code)
```

which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
ROCm#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
ROCm#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
ROCm#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
ROCm#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
ROCm#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
ROCm#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
ROCm#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
ROCm#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
ROCm#12 at::native::abs(at::Tensor const&) from ??:0
ROCm#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
ROCm#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
ROCm#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
ROCm#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
ROCm#17 at::_ops::abs::call(at::Tensor const&) from ??:0
ROCm#18 at::native::isfinite(at::Tensor const&) from ??:0
ROCm#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
ROCm#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
ROCm#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
ROCm#22 PyObject_CallFunctionObjArgs from ??:0
ROCm#23 _PyObject_MakeTpCall from ??:0
ROCm#24 _PyEval_EvalFrameDefault from ??:0
ROCm#25 _PyObject_FastCallDictTstate from ??:0
ROCm#26 _PyStack_AsDict from ??:0
ROCm#27 _PyObject_MakeTpCall from ??:0
ROCm#28 _PyEval_EvalFrameDefault from ??:0
ROCm#29 _PyFunction_Vectorcall from ??:0
ROCm#30 _PyEval_EvalFrameDefault from ??:0
ROCm#31 _PyFunction_Vectorcall from ??:0
ROCm#32 _PyEval_EvalFrameDefault from ??:0
ROCm#33 _PyFunction_Vectorcall from ??:0
ROCm#34 _PyEval_EvalFrameDefault from ??:0
ROCm#35 PyFrame_GetCode from ??:0
ROCm#36 PyNumber_Xor from ??:0
ROCm#37 PyObject_Str from ??:0
ROCm#38 PyFile_WriteObject from ??:0
ROCm#39 _PyWideStringList_AsList from ??:0
ROCm#40 _PyDict_NewPresized from ??:0
ROCm#41 _PyEval_EvalFrameDefault from ??:0
ROCm#42 PyEval_EvalCode from ??:0
ROCm#43 PyEval_EvalCode from ??:0
ROCm#44 PyUnicode_Tailmatch from ??:0
ROCm#45 PyInit__collections from ??:0
ROCm#46 PyUnicode_Tailmatch from ??:0
ROCm#47 _PyRun_SimpleFileObject from ??:0
ROCm#48 _PyRun_AnyFileObject from ??:0
ROCm#49 Py_RunMain from ??:0
ROCm#50 Py_BytesMain from ??:0
ROCm#51 __libc_init_first from ??:0
ROCm#52 __libc_start_main from ??:0
ROCm#53 _start from ??:0

Captured error code is  710
```
Pull Request resolved: pytorch#152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: pytorch#154436
pytorchmergebot pushed a commit that referenced this pull request Jun 14, 2025
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    ROCm#3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: pytorch#154809
Approved by: https://github.com/soulitzer
pytorchmergebot pushed a commit that referenced this pull request Jun 14, 2025
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b

Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
   556 	  bool has_names() const {
   557 	    // If a user is using unnamed tensors, then we can short-circuit right here.
   558 	    // Otherwise, impl::has_names attempts to retrieve names.
-> 559 	    if (!impl_->has_named_tensor_meta()) {
   560 	      return false;
   561 	    }
   562 	    return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
   20  	int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
   21  	  TORCH_CHECK(dim.type() != NameType::WILDCARD,
   22  	      "Please look up dimensions by name, got: name = None.");
-> 23  	  TORCH_CHECK(tensor.has_names(),
   24  	      "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
   25  	  const auto names = tensor.names();
   26
```

TODOs:
 - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
 - Replace  `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests

Fixes pytorch#155306
Pull Request resolved: pytorch#155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: pytorch#155382
pytorchmergebot pushed a commit that referenced this pull request Jun 30, 2025
…torch#156600)

Don't call `sum()` on a tensor that is default constructed.

Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:

```
Traceback (most recent call last):
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
    method(*args, **kwargs)
  File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
    ln_out_cuda.backward(grad_output_cuda)
  File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
    torch.autograd.backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
ROCm#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
ROCm#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
ROCm#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
ROCm#7 at::TensorBase::options() const from :0
ROCm#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
ROCm#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
ROCm#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
ROCm#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
ROCm#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
ROCm#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
ROCm#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
ROCm#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
ROCm#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
ROCm#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0

```

Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: pytorch#156600
Approved by: https://github.com/eqy, https://github.com/ngimel
pytorchmergebot pushed a commit that referenced this pull request Jul 22, 2025
For tensor with non-zero offset, it must be multiplied by element size

Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with
```
C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15
Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib)
frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib)
frame ROCm#3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib)
frame ROCm#4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib)
frame ROCm#5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib)
frame ROCm#6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib)
frame ROCm#7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib)
frame ROCm#8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib)
frame ROCm#9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib)
frame ROCm#10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib)
frame ROCm#11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib)
frame ROCm#12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic)
frame ROCm#13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic)
```
Pull Request resolved: pytorch#158690
Approved by: https://github.com/angelayi
alugorey pushed a commit that referenced this pull request Dec 17, 2025
This is the necessary fix for meta-pytorch/autoparallel#256.

### Issue:
when we call `_clear_fast_path_sharding_prop_cache()`, and then `get_thread_local_native_sharding_propagator_cache()`, the code will stuck due to deadlock.

### Cause:
When you assign to a Python dict key that already exists:
```C++
thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = old_capsule  // capsule #1 stored
...
clear_DTensor_sharding_propagator_cache() // call to clean up the cache
...
get_thread_local_native_sharding_propagator_cache() {
  std::lock_guard<std::mutex> lock(
        native_sharding_propagator_cache_cleanup_mutex);  // FIRST claims the lock!
  if (!native_sharding_propagator_cache_DO_NOT_USE.has_value()) { // enter this again because we have cleared the cache.
    ...
    // Destroys old_capsule FIRST then stores new_capsule. However, where we destroy the old_capsule,
    // it will trigger the destructor to claim `native_sharding_propagator_cache_cleanup_mutex` again!
    thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = new_capsule  // SECOND claims the lock before FIRST releases
  }
}
```

Pull Request resolved: pytorch#168069
Approved by: https://github.com/ezyang
alugorey pushed a commit that referenced this pull request Dec 17, 2025
…orch#169475)

pytorch#168155 was needed to fix Windows CI in torchaudio that looked like such

<details>
<summary><b>click for example of torchaudio windows CI error</b></summary>
<br>

```
2025-11-15T21:11:03.9005985Z   C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(244): error: more than one instance of overloaded function "torch::stable::detail::from" matches the argument list:
2025-11-15T21:11:03.9007831Z               function template "StableIValue from(T)" (declared at line 593)
2025-11-15T21:11:03.9008639Z               function template "StableIValue torch::stable::detail::from(T)" (declared at line 528)
2025-11-15T21:11:03.9009336Z               argument types are: (StableListHandle)
2025-11-15T21:11:03.9009839Z           return from(new_list_handle);
2025-11-15T21:11:03.9010244Z                  ^
2025-11-15T21:11:03.9011886Z   C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(541): note pytorch#3326-D: function "torch::stable::detail::from(const torch::stable::Tensor &)" does not match because argument #1 does not match parameter
2025-11-15T21:11:03.9013826Z     [[maybe_unused]] inline StableIValue from(const torch::stable::Tensor& val) {
2025-11-15T21:11:03.9014403Z                                          ^
2025-11-15T21:11:03.9016129Z   C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(534): note pytorch#3327-D: candidate function template "torch::stable::detail::from(const std::optional<T> &)" failed deduction
2025-11-15T21:11:03.9017869Z     inline StableIValue from(const std::optional<T>& val) {
2025-11-15T21:11:03.9018335Z                         ^
2025-11-15T21:11:03.9019885Z   C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(609): note pytorch#3326-D: function "from(const torch::stable::Tensor &)" does not match because argument #1 does not match parameter
2025-11-15T21:11:03.9021652Z     from(const torch::stable::Tensor& val) {
2025-11-15T21:11:03.9022058Z     ^
2025-11-15T21:11:03.9023430Z   C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(601): note pytorch#3327-D: candidate function template "from(const std::optional<T> &)" failed deduction
2025-11-15T21:11:03.9025327Z     inline StableIValue from(const std::optional<T>& val) {
2025-11-15T21:11:03.9025793Z                         ^
2025-11-15T21:11:03.9026102Z             detected during:
2025-11-15T21:11:03.9027321Z               instantiation of "StableIValue torch::stable::detail::FromImpl<c10::HeaderOnlyArrayRef<T>>::call(const c10::HeaderOnlyArrayRef<T> &, uint64_t, __nv_bool) [with T=int64_t]" at line 529
2025-11-15T21:11:03.9029527Z               instantiation of "StableIValue torch::stable::detail::from(T) [with T=torch::headeronly::IntHeaderOnlyArrayRef]" at line 319 of C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/ops.h
2025-11-15T21:11:03.9030992Z
2025-11-15T21:11:03.9031753Z   1 error detected in the compilation of "C:/actions-runner/_work/audio/audio/pytorch/audio/src/libtorchaudio/forced_align/gpu/compute.cu"
```

</details>

But this broke BC in that after that PR `from(...)` is no longer usable without template arguments, which makes the code in fa3 https://github.com/Dao-AILab/flash-attention/blob/ad70a007e6287d4f7e766f94bcf2f9a813f20f6b/hopper/flash_api_stable.cpp#L1797-L1800 no longer compilable in 2.10

We could update the code in FA3, but that might require ifdefs for 2.9 vs 2.10 -- as a general principle for stable extensions, I'm not sure whether updating the extension code or not breaking BC of the headers is what we should go with here. But I'm leaning towards the latter.

This PR takes the alternative approach of restoring torchaudio Windows CI sanity by replacing all `{from/to}` in torch/csrc/stable/stableivalue_conversions.h with `torch::stable::detail::{from/to}` rather than making the `from`/`to` in the global namespace a function pointer

Confirmed that audio CI passes pytorch/audio#4133

Pull Request resolved: pytorch#169475
Approved by: https://github.com/albanD
alugorey pushed a commit that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
alugorey pushed a commit that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
alugorey pushed a commit that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
alugorey pushed a commit that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
alugorey pushed a commit that referenced this pull request Dec 17, 2025
…torch#168129)

(This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091)

We did the following thing:
1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28.
2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it)
3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works.
4. Show that symmem from nccl backend works with traditional c10d collective as well in UT.
5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel.

Resolves pytorch#167682
Pull Request resolved: pytorch#168129
Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman
alugorey pushed a commit that referenced this pull request Feb 4, 2026
If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error.

When I use a compilation optimization of ` -O1` I see this issue:
```
[src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy'
    #0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14
    #1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19
    #2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8
    ROCm#3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32
    ROCm#4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5
    ROCm#5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1
    ROCm#6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp
    ROCm#7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7
    ROCm#8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d)
    ROCm#9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120
```
Pull Request resolved: pytorch#172581
Approved by: https://github.com/guangyey, https://github.com/albanD
alugorey pushed a commit that referenced this pull request Feb 9, 2026
…ytorch#174247)

Summary:
This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash.

The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal.

This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2.

Test Plan:
In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally

```
(lldb) thread siginfo
thread #1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000)

(__lldb_siginfo_t) __lldb_siginfo = {
  si_signo = 11
  si_errno = 0
  si_code = 1
  __pad0 = 0
  _sifields = {
    _kill = (si_pid = 4096, si_uid = 0)
    _timer = {
      si_tid = 4096
      si_overrun = 0
      si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000)
    }
    _rt = {
      si_pid = 4096
      si_uid = 0
      si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000)
    }
    _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0)
    _sigfault = {
      si_addr = 0x0000000000001000
      si_addr_lsb = 0
      _bounds = {
        _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000)
        _pkey = 0
      }
    }
    _sigpoll = (si_band = 4096, si_fd = 0)
    _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0)
  }
}
```

And we see the siginfo contains the address which triggered the original SEGV.

Differential Revision: D92093984

Pull Request resolved: pytorch#174247
Approved by: https://github.com/Skylion007
alugorey pushed a commit that referenced this pull request Mar 9, 2026
…c8 kernel (pytorch#174362)

This will allow `sm_103` devices call vec8 kernels.
Verification script:
```Python
import torch
from torch.profiler import profile, ProfilerActivity

device = torch.device("cuda")

for dtype in (torch.bfloat16, torch.float16,):
    x = torch.randn(1024, device=device, dtype=dtype)
    with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
        y = torch.relu(x)
    stats = prof.key_averages()
    for entry in stats:
        if "at::native::vectorized_elementwise_kernel" in entry.key:
            print(entry.key)
```

Before:
```
void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul>)
void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul>)
```

After:
```
void at::native::vectorized_elementwise_kernel<8, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul>)
void at::native::vectorized_elementwise_kernel<8, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul>)
```

Pull Request resolved: pytorch#174362
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant