Skip to content

[ARM] Fix infinite recursion in unwind#134387

Closed
Aidyn-A wants to merge 3 commits intopytorch:mainfrom
Aidyn-A:fix_unwind_aarch64
Closed

[ARM] Fix infinite recursion in unwind#134387
Aidyn-A wants to merge 3 commits intopytorch:mainfrom
Aidyn-A:fix_unwind_aarch64

Conversation

@Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented Aug 24, 2024

Fixes #119905

The TORCH_SHOW_CPP_STACKTRACES=1 setting on ARM causes infinite recursive unwind because on failure a StackTraceFetcher attempts to unwind the failed instruction:

r->cpp_frames_ = unwind::unwind();

then the unwind itself fails:
TORCH_CHECK(
false,
"record_context_cpp is not support on non-linux non-x86_64 platforms");

and it causes another attempt to unwind the failure in unwind()...

In summary, the executed instruction is equivalent to:

std::vector<void*> unwind() {
  // some instructions ...
  return unwind();
}

This PR replaces TORCH_CHECK by TORCH_WARN_ONCE as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace.

Huge thanks to @nWEIdia who found the root cause!

cc @malfet @snadampal @milpuz01

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134387

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1de64b9 with merge base 2553278 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geniously speed of light forward fix!

@Aidyn-A Aidyn-A added ciflow/trunk Trigger trunk jobs on your pull request module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 labels Aug 24, 2024
@malfet
Copy link
Contributor

malfet commented Aug 24, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 24, 2024

Perhaps as a follow up, we can remove IS_MACOS and the inductor check line?

@Aidyn-A Aidyn-A requested a review from a team as a code owner August 26, 2024 16:32
@albanD albanD removed their request for review August 26, 2024 16:55
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Aug 26, 2024

All tests are green.

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Fixes pytorch#119905

The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: https://github.com/pytorch/pytorch/blob/5ad759ca33ba8299cf7e1a6bb1dff7c9a5555e29/torch/csrc/profiler/combined_traceback.cpp#L25
then the unwind itself fails:
https://github.com/pytorch/pytorch/blob/5ad759ca33ba8299cf7e1a6bb1dff7c9a5555e29/torch/csrc/profiler/unwind/unwind.cpp#L10-L12
and it causes another attempt to unwind the failure in `unwind()`...

In summary, the executed instruction is equivalent to:
```C++
std::vector<void*> unwind() {
  // some instructions ...
  return unwind();
}
```
This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace.

Huge thanks to @nWEIdia who found the root cause!

Pull Request resolved: pytorch#134387
Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Mar 15, 2026
…nces between x86 vs aarch64 (#176085)

In the test:

```
python  test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda
```
 it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues:
  - #119905
  - #134387

In the current setup when getting a stack trace string:
- x86 contains `C++ CapturedTraceback:`
- aarch64 contains `Exception raised from` + `frame #`

An example of the full string from an aarch64 system when :
```
AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #11: /usr/bin/python() [0x504a34]\nframe #12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe #13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe #14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe #15: /usr/bin/python() [0x52a070]\nframe #16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe #17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe #18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe #19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe #20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe #21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe #22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe #23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n'

To execute this test, run the following from the base repo dir:
    python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda
```

Pull Request resolved: #176085
Approved by: https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inductor test segfault on macos (and arm64) when C++ stacktraces are enabled

6 participants