Attn bias by alugorey · Pull Request #1 · alugorey/pytorch

alugorey · 2025-01-21T17:35:14Z

Fixes #ISSUE_NUMBER

…pytorch#144120) (pytorch#146372) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()ROCm#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: pytorch#146372 Approved by: https://github.com/jbschlosser

Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: ROCm#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 ROCm#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 ROCm#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 ROCm#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 ROCm#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 ROCm#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 ROCm#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 ROCm#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 ROCm#12 at::native::abs(at::Tensor const&) from ??:0 ROCm#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 ROCm#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 ROCm#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 ROCm#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 ROCm#17 at::_ops::abs::call(at::Tensor const&) from ??:0 ROCm#18 at::native::isfinite(at::Tensor const&) from ??:0 ROCm#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 ROCm#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 ROCm#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 ROCm#22 PyObject_CallFunctionObjArgs from ??:0 ROCm#23 _PyObject_MakeTpCall from ??:0 ROCm#24 _PyEval_EvalFrameDefault from ??:0 ROCm#25 _PyObject_FastCallDictTstate from ??:0 ROCm#26 _PyStack_AsDict from ??:0 ROCm#27 _PyObject_MakeTpCall from ??:0 ROCm#28 _PyEval_EvalFrameDefault from ??:0 ROCm#29 _PyFunction_Vectorcall from ??:0 ROCm#30 _PyEval_EvalFrameDefault from ??:0 ROCm#31 _PyFunction_Vectorcall from ??:0 ROCm#32 _PyEval_EvalFrameDefault from ??:0 ROCm#33 _PyFunction_Vectorcall from ??:0 ROCm#34 _PyEval_EvalFrameDefault from ??:0 ROCm#35 PyFrame_GetCode from ??:0 ROCm#36 PyNumber_Xor from ??:0 ROCm#37 PyObject_Str from ??:0 ROCm#38 PyFile_WriteObject from ??:0 ROCm#39 _PyWideStringList_AsList from ??:0 ROCm#40 _PyDict_NewPresized from ??:0 ROCm#41 _PyEval_EvalFrameDefault from ??:0 ROCm#42 PyEval_EvalCode from ??:0 ROCm#43 PyEval_EvalCode from ??:0 ROCm#44 PyUnicode_Tailmatch from ??:0 ROCm#45 PyInit__collections from ??:0 ROCm#46 PyUnicode_Tailmatch from ??:0 ROCm#47 _PyRun_SimpleFileObject from ??:0 ROCm#48 _PyRun_AnyFileObject from ??:0 ROCm#49 Py_RunMain from ??:0 ROCm#50 Py_BytesMain from ??:0 ROCm#51 __libc_init_first from ??:0 ROCm#52 __libc_start_main from ??:0 ROCm#53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: pytorch#152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: pytorch#154436

Use uint64_t index types to avoid ``` torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long' #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24 #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12 #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 ROCm#3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 ``` Pull Request resolved: pytorch#154809 Approved by: https://github.com/soulitzer

Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes pytorch#155306 Pull Request resolved: pytorch#155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: pytorch#155382

…torch#156600) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(*args, **kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: ROCm#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 ROCm#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 ROCm#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 ROCm#7 at::TensorBase::options() const from :0 ROCm#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 ROCm#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 ROCm#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 ROCm#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 ROCm#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 ROCm#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0 ROCm#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 ROCm#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 ROCm#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 ROCm#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: pytorch#156600 Approved by: https://github.com/eqy, https://github.com/ngimel

For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame ROCm#3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame ROCm#4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame ROCm#5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame ROCm#6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame ROCm#7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame ROCm#8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame ROCm#9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame ROCm#10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame ROCm#11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame ROCm#12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame ROCm#13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: pytorch#158690 Approved by: https://github.com/angelayi

This is the necessary fix for meta-pytorch/autoparallel#256. ### Issue: when we call `_clear_fast_path_sharding_prop_cache()`, and then `get_thread_local_native_sharding_propagator_cache()`, the code will stuck due to deadlock. ### Cause: When you assign to a Python dict key that already exists: ```C++ thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = old_capsule // capsule #1 stored ... clear_DTensor_sharding_propagator_cache() // call to clean up the cache ... get_thread_local_native_sharding_propagator_cache() { std::lock_guard<std::mutex> lock( native_sharding_propagator_cache_cleanup_mutex); // FIRST claims the lock! if (!native_sharding_propagator_cache_DO_NOT_USE.has_value()) { // enter this again because we have cleared the cache. ... // Destroys old_capsule FIRST then stores new_capsule. However, where we destroy the old_capsule, // it will trigger the destructor to claim `native_sharding_propagator_cache_cleanup_mutex` again! thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = new_capsule // SECOND claims the lock before FIRST releases } } ``` Pull Request resolved: pytorch#168069 Approved by: https://github.com/ezyang

…orch#169475) pytorch#168155 was needed to fix Windows CI in torchaudio that looked like such <details> <summary><b>click for example of torchaudio windows CI error</b></summary> <br> ``` 2025-11-15T21:11:03.9005985Z C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(244): error: more than one instance of overloaded function "torch::stable::detail::from" matches the argument list: 2025-11-15T21:11:03.9007831Z function template "StableIValue from(T)" (declared at line 593) 2025-11-15T21:11:03.9008639Z function template "StableIValue torch::stable::detail::from(T)" (declared at line 528) 2025-11-15T21:11:03.9009336Z argument types are: (StableListHandle) 2025-11-15T21:11:03.9009839Z return from(new_list_handle); 2025-11-15T21:11:03.9010244Z ^ 2025-11-15T21:11:03.9011886Z C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(541): note pytorch#3326-D: function "torch::stable::detail::from(const torch::stable::Tensor &)" does not match because argument #1 does not match parameter 2025-11-15T21:11:03.9013826Z [[maybe_unused]] inline StableIValue from(const torch::stable::Tensor& val) { 2025-11-15T21:11:03.9014403Z ^ 2025-11-15T21:11:03.9016129Z C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(534): note pytorch#3327-D: candidate function template "torch::stable::detail::from(const std::optional<T> &)" failed deduction 2025-11-15T21:11:03.9017869Z inline StableIValue from(const std::optional<T>& val) { 2025-11-15T21:11:03.9018335Z ^ 2025-11-15T21:11:03.9019885Z C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(609): note pytorch#3326-D: function "from(const torch::stable::Tensor &)" does not match because argument #1 does not match parameter 2025-11-15T21:11:03.9021652Z from(const torch::stable::Tensor& val) { 2025-11-15T21:11:03.9022058Z ^ 2025-11-15T21:11:03.9023430Z C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/stableivalue_conversions.h(601): note pytorch#3327-D: candidate function template "from(const std::optional<T> &)" failed deduction 2025-11-15T21:11:03.9025327Z inline StableIValue from(const std::optional<T>& val) { 2025-11-15T21:11:03.9025793Z ^ 2025-11-15T21:11:03.9026102Z detected during: 2025-11-15T21:11:03.9027321Z instantiation of "StableIValue torch::stable::detail::FromImpl<c10::HeaderOnlyArrayRef<T>>::call(const c10::HeaderOnlyArrayRef<T> &, uint64_t, __nv_bool) [with T=int64_t]" at line 529 2025-11-15T21:11:03.9029527Z instantiation of "StableIValue torch::stable::detail::from(T) [with T=torch::headeronly::IntHeaderOnlyArrayRef]" at line 319 of C:/actions-runner/_work/audio/audio/pytorch/audio/env/Lib/site-packages/torch/include\torch/csrc/stable/ops.h 2025-11-15T21:11:03.9030992Z 2025-11-15T21:11:03.9031753Z 1 error detected in the compilation of "C:/actions-runner/_work/audio/audio/pytorch/audio/src/libtorchaudio/forced_align/gpu/compute.cu" ``` </details> But this broke BC in that after that PR `from(...)` is no longer usable without template arguments, which makes the code in fa3 https://github.com/Dao-AILab/flash-attention/blob/ad70a007e6287d4f7e766f94bcf2f9a813f20f6b/hopper/flash_api_stable.cpp#L1797-L1800 no longer compilable in 2.10 We could update the code in FA3, but that might require ifdefs for 2.9 vs 2.10 -- as a general principle for stable extensions, I'm not sure whether updating the extension code or not breaking BC of the headers is what we should go with here. But I'm leaning towards the latter. This PR takes the alternative approach of restoring torchaudio Windows CI sanity by replacing all `{from/to}` in torch/csrc/stable/stableivalue_conversions.h with `torch::stable::detail::{from/to}` rather than making the `from`/`to` in the global namespace a function pointer Confirmed that audio CI passes pytorch/audio#4133 Pull Request resolved: pytorch#169475 Approved by: https://github.com/albanD

…torch#168129) (This PR will be rebased on pytorch#166174) (There are other PR which updates NCCL version: pytorch#168091) We did the following thing: 1. To add exchange of buffer ptr and signal pad ptr via NCCL device API introduced in nccl 2.28. 2. With #1, we showed that the symmem from nccl backend works with existing one_shot_all_reduce kernel (Add a UT for it) 3. Add a simple put, put with signal, wait for signal and get. So that symmem's one side API works. 4. Show that symmem from nccl backend works with traditional c10d collective as well in UT. 5. Stored DevComm inside symmetric memory so that users can access to it for customized kernel. Resolves pytorch#167682 Pull Request resolved: pytorch#168129 Approved by: https://github.com/kwen2501, https://github.com/ngimel, https://github.com/atalman

If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error. When I use a compilation optimization of ` -O1` I see this issue: ``` [src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy' #0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14 #1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19 #2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8 ROCm#3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32 ROCm#4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5 ROCm#5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1 ROCm#6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp ROCm#7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7 ROCm#8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d) ROCm#9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120 ``` Pull Request resolved: pytorch#172581 Approved by: https://github.com/guangyey, https://github.com/albanD

…ytorch#174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread #1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Differential Revision: D92093984 Pull Request resolved: pytorch#174247 Approved by: https://github.com/Skylion007

…c8 kernel (pytorch#174362) This will allow `sm_103` devices call vec8 kernels. Verification script: ```Python import torch from torch.profiler import profile, ProfilerActivity device = torch.device("cuda") for dtype in (torch.bfloat16, torch.float16,): x = torch.randn(1024, device=device, dtype=dtype) with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof: y = torch.relu(x) stats = prof.key_averages() for entry in stats: if "at::native::vectorized_elementwise_kernel" in entry.key: print(entry.key) ``` Before: ``` void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul>) void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul>) ``` After: ``` void at::native::vectorized_elementwise_kernel<8, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#9}::operator()() const::{lambda(c10::BFloat16)#1}, std::array<char*, 2ul>) void at::native::vectorized_elementwise_kernel<8, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul> >(int, at::native::(anonymous namespace)::launch_clamp_scalar(at::TensorIteratorBase&, c10::Scalar, c10::Scalar, at::native::detail::ClampLimits)::{lambda()#1}::operator()() const::{lambda()ROCm#8}::operator()() const::{lambda(c10::Half)#1}, std::array<char*, 2ul>) ``` Pull Request resolved: pytorch#174362 Approved by: https://github.com/ngimel

alugorey added 6 commits January 20, 2025 21:17

Adding plumbing for attn_bias in CK sdpa

454334e

Shim stuff

6bd43c4

Generated c_shim_cuda.h changes for just _sdpfa

8bdf347

Add flash_attention attn_bias param

af6f988

add default to derivatives

211fe7e

Remove default nature of attn_bias

2b59ab6

alugorey force-pushed the attn_bias branch from 34767e6 to 2b59ab6 Compare January 23, 2025 22:40

alugorey added 5 commits January 24, 2025 16:33

functorch default arg

e8452fc

nested tensor change

ad6f3eb

don't call v3 from v2

f31a0a6

swizzle some args

25981de

nested stuff

1194958

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attn bias#1

Attn bias#1
alugorey wants to merge 11 commits intoupdate_ck_newestfrom
attn_bias

alugorey commented Jan 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alugorey commented Jan 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant