[ARM] Fix infinite recursion in unwind by Aidyn-A · Pull Request #134387 · pytorch/pytorch

Aidyn-A · 2024-08-24T04:36:41Z

Fixes #119905

The TORCH_SHOW_CPP_STACKTRACES=1 setting on ARM causes infinite recursive unwind because on failure a StackTraceFetcher attempts to unwind the failed instruction:

pytorch/torch/csrc/profiler/combined_traceback.cpp

Line 25 in 5ad759c

r->cpp_frames_ = unwind::unwind();

then the unwind itself fails:

pytorch/torch/csrc/profiler/unwind/unwind.cpp

Lines 10 to 12 in 5ad759c

    
           TORCH_CHECK( 
        
               false, 
        
               "record_context_cpp is not support on non-linux non-x86_64 platforms");

and it causes another attempt to unwind the failure in unwind()...

In summary, the executed instruction is equivalent to:

std::vector<void*> unwind() {
  // some instructions ...
  return unwind();
}

This PR replaces TORCH_CHECK by TORCH_WARN_ONCE as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace.

Huge thanks to @nWEIdia who found the root cause!

cc @malfet @snadampal @milpuz01

pytorch-bot · 2024-08-24T04:36:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134387

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1de64b9 with merge base 2553278 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nWEIdia

Geniously speed of light forward fix!

malfet · 2024-08-24T19:06:46Z

@pytorchbot merge

pytorchmergebot · 2024-08-24T19:08:31Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

nWEIdia · 2024-08-24T23:19:06Z

Perhaps as a follow up, we can remove IS_MACOS and the inductor check line?

test/run_test.py

Aidyn-A · 2024-08-26T20:55:19Z

All tests are green.

@pytorchbot merge

pytorchmergebot · 2024-08-26T20:57:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@nWEIdia

Fixes pytorch#119905 The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: https://github.com/pytorch/pytorch/blob/5ad759ca33ba8299cf7e1a6bb1dff7c9a5555e29/torch/csrc/profiler/combined_traceback.cpp#L25 then the unwind itself fails: https://github.com/pytorch/pytorch/blob/5ad759ca33ba8299cf7e1a6bb1dff7c9a5555e29/torch/csrc/profiler/unwind/unwind.cpp#L10-L12 and it causes another attempt to unwind the failure in `unwind()`... In summary, the executed instruction is equivalent to: ```C++ std::vector<void*> unwind() { // some instructions ... return unwind(); } ``` This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace. Huge thanks to @nWEIdia who found the root cause! Pull Request resolved: pytorch#134387 Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet

…nces between x86 vs aarch64 (#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - #119905 - #134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #11: /usr/bin/python() [0x504a34]\nframe #12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe #13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe #14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe #15: /usr/bin/python() [0x52a070]\nframe #16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe #17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe #18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe #19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe #20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe #21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe #22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe #23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: #176085 Approved by: https://github.com/eqy

fix infinite recursion in aarch64 unwind

98e27b1

Aidyn-A requested review from albanD and malfet August 24, 2024 04:36

Aidyn-A requested review from aaronenyeshi and sraikund16 as code owners August 24, 2024 04:36

eqy approved these changes Aug 24, 2024

View reviewed changes

pytorchbot added the open source label Aug 24, 2024

nWEIdia approved these changes Aug 24, 2024

View reviewed changes

Aidyn-A added ciflow/trunk Trigger trunk jobs on your pull request module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 labels Aug 24, 2024

malfet approved these changes Aug 24, 2024

View reviewed changes

pytorchmergebot added the merging label Aug 24, 2024

pytorchmergebot removed the merging label Aug 24, 2024

Aidyn-A added the topic: not user facing topic category label Aug 24, 2024

Print full c++ stack traces during retries for Mac OS

d54e9f6

Aidyn-A requested a review from a team as a code owner August 26, 2024 16:32

Aidyn-A added the ciflow/inductor label Aug 26, 2024

nWEIdia reviewed Aug 26, 2024

View reviewed changes

test/run_test.py Outdated Show resolved Hide resolved

albanD removed their request for review August 26, 2024 16:55

Print full c++ stack traces during retries on Mac for sure

1de64b9

nWEIdia reviewed Aug 26, 2024

View reviewed changes

test/run_test.py Show resolved Hide resolved

pytorchmergebot added the merging label Aug 26, 2024

pytorchmergebot added the Merged label Aug 26, 2024

pytorchmergebot closed this in 28a4db8 Aug 26, 2024

pytorchmergebot removed the merging label Aug 26, 2024

ppham-nv mentioned this pull request Mar 1, 2026

[test] fix test_libtorch_agnostic.py to hande c++ stack trace differences between x86 vs aarch64 #176085

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM] Fix infinite recursion in unwind#134387

[ARM] Fix infinite recursion in unwind#134387
Aidyn-A wants to merge 3 commits intopytorch:mainfrom
Aidyn-A:fix_unwind_aarch64

Aidyn-A commented Aug 24, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 24, 2024 •

edited

Loading

Uh oh!

nWEIdia left a comment

Uh oh!

malfet commented Aug 24, 2024

Uh oh!

pytorchmergebot commented Aug 24, 2024

Uh oh!

nWEIdia commented Aug 24, 2024

Uh oh!

Uh oh!

Uh oh!

Aidyn-A commented Aug 26, 2024

Uh oh!

pytorchmergebot commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	TORCH_CHECK(
	false,
	"record_context_cpp is not support on non-linux non-x86_64 platforms");

Conversation

Aidyn-A commented Aug 24, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134387

✅ No Failures

Uh oh!

nWEIdia left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Aug 24, 2024

Uh oh!

pytorchmergebot commented Aug 24, 2024

Merge failed

Uh oh!

nWEIdia commented Aug 24, 2024

Uh oh!

Uh oh!

Uh oh!

Aidyn-A commented Aug 26, 2024

Uh oh!

pytorchmergebot commented Aug 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Aidyn-A commented Aug 24, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 24, 2024 •

edited

Loading