Skip to content

[ci][cpu] Update compiler to GCC-13 in jammy-aarch64#166849

Closed
fadara01 wants to merge 7 commits intogh/fadara01/7/basefrom
gh/fadara01/7/head
Closed

[ci][cpu] Update compiler to GCC-13 in jammy-aarch64#166849
fadara01 wants to merge 7 commits intogh/fadara01/7/basefrom
gh/fadara01/7/head

Conversation

@fadara01
Copy link
Copy Markdown
Collaborator

@fadara01 fadara01 commented Nov 3, 2025

Stack from ghstack (oldest at bottom):

This is needed because manylinux uses GCC-13 since #152825
As a result of the current compiler version mismatches, we've seen tests passing jammy-aarch64 pre-commit CI, but failing for wheels built in manylinux
Related to: #166736

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @seemethere @malfet @pytorch/pytorch-dev-infra @snadampal @milpuz01 @nikhil-arm

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Nov 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166849

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit cf99307 with merge base fbd70fb (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fadara01 added a commit that referenced this pull request Nov 3, 2025
This is needed because manylinux uses GCC-13 since #152825
As a result of the current compiler version mismatches, we've seen tests
passing jammy-aarch64 pre-commit CI, but failing for wheels built in manylinux
Related to: #166736


ghstack-source-id: a68de47
Pull-Request: #166849
@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Nov 3, 2025
@fadara01 fadara01 added module: cpu CPU specific problem (e.g., perf, algorithm) module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 ciflow/linux-aarch64 linux aarch64 CI workflow labels Nov 3, 2025
@robert-hardwick robert-hardwick added the module: ci Related to continuous integration label Nov 3, 2025
Copy link
Copy Markdown
Collaborator

@robert-hardwick robert-hardwick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update GCC_VERSION variable

Comment thread .ci/docker/build.sh Outdated
Comment thread .ci/docker/build.sh Outdated
[ghstack-poisoned]
@fadara01 fadara01 requested review from a team and jeffdaily as code owners November 3, 2025 10:21
fadara01 added a commit that referenced this pull request Nov 3, 2025
This is needed because manylinux uses GCC-13 since #152825
As a result of the current compiler version mismatches, we've seen tests
passing jammy-aarch64 pre-commit CI, but failing for wheels built in manylinux
Related to: #166736

ghstack-source-id: 9d18d85
Pull-Request: #166849
Copy link
Copy Markdown
Collaborator

@robert-hardwick robert-hardwick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 3, 2025

Actually, we also need to update both manylinux and jammy to GCC14 as per @malfet 's comment about manylinux standards on #166736:

Also, per manylinux2_28 standard all builds should be done by gcc-14 toolchain (see https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based ) if this is not the case, than it's a bug, please don't hesitate to propose a PR that fixes it

Let's address the GCC version mismatch for AArch64 between jammy and manylinux first (and get jammy to build with GCC13), then we'll raise PRs to this stack updating both to GCC14, which I think is related to #149828 and #152426

@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 3, 2025

Oh, OpenBLAS is failing to link with GCC-13 in jammy due to missing -lgfortran

/usr/bin/ld: cannot find -lgfortran: No such file or directory
/usr/bin/ld: cannot find -lgfortran: No such file or directory
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:207: ../libopenblasp-r0.3.30.so] Error 1
make: *** [Makefile:149: shared] Error 2

#80 ERROR: process "/bin/sh -c if [ -n \"${OPENBLAS}\" ]; then bash ./install_openblas.sh; fi" did not complete successfully: exit code: 2

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Nov 3, 2025

Oh, OpenBLAS is failing to link with GCC-13 in jammy due to missing -lgfortran

/usr/bin/ld: cannot find -lgfortran: No such file or directory
/usr/bin/ld: cannot find -lgfortran: No such file or directory
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:207: ../libopenblasp-r0.3.30.so] Error 1
make: *** [Makefile:149: shared] Error 2

#80 ERROR: process "/bin/sh -c if [ -n \"${OPENBLAS}\" ]; then bash ./install_openblas.sh; fi" did not complete successfully: exit code: 2

@fadara01 just install gcc13-gfortran or something (I bet there is a line somewhere in the scripts that does it already)

@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 3, 2025

gcc13-gfortran or something (I bet there is a line somewhere in the scripts that does it already)

Yup that's what I'm doing locally

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Nov 3, 2025

gcc13-gfortran or something (I bet there is a line somewhere in the scripts that does it already)

Yup that's what I'm doing locally

Alternative, you can just move CI to a more recent version of Ubuntu

[ghstack-poisoned]
fadara01 added a commit that referenced this pull request Nov 3, 2025
This is needed because manylinux uses GCC-13 since #152825
As a result of the current compiler version mismatches, we've seen tests
passing jammy-aarch64 pre-commit CI, but failing for wheels built in manylinux
Related to: #166736

ghstack-source-id: 36be71d
Pull-Request: #166849
@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 3, 2025

Current linux-aarch64 / linux-jammy-aarch64-py3.10 / build failure is a known issue with GCC 13 (which failed nightly too) from #166687

In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp:6,
                 from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/Loops.h: In function ‘void at::native::SVE256::vectorized_loop(char**, int64_t, int64_t, func_t&&, vec_func_t&&) [with func_t = at::native::{anonymous}::smooth_l1_backward_cpu_kernel(at::TensorIterator&, const c10::Scalar&, double)::<lambda()>::<lambda()>::<lambda(scalar_t, scalar_t, scalar_t)>&; vec_func_t = at::native::{anonymous}::smooth_l1_backward_cpu_kernel(at::TensorIterator&, const c10::Scalar&, double)::<lambda()>::<lambda()>::<lambda(at::vec::SVE256::Vectorized<c10::Half>, at::vec::SVE256::Vectorized<c10::Half>, at::vec::SVE256::Vectorized<c10::Half>)>&]’:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/Loops.h:200:1: internal compiler error: in expand_insn, at optabs.cc:8185
  200 | vectorized_loop(char** C10_RESTRICT data_, int64_t n, int64_t S, func_t&& op, vec_func_t&& vop) {
      | ^~~~~~~~~~~~~~~

[ghstack-poisoned]
@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 3, 2025

After rebasing, we now have a new failure from a GCC13 warning that's being treated as error:

In file included from /usr/include/c++/13/bits/stl_uninitialized.h:63,
                 from /usr/include/c++/13/memory:69,
                 from /var/lib/jenkins/workspace/third_party/googletest/googletest/include/gtest/gtest.h:55,
                 from /var/lib/jenkins/workspace/test/cpp/api/inference_mode.cpp:1:
In static member function ‘static _Up* std::__copy_move<_IsMove, true, std::random_access_iterator_tag>::__copy_m(_Tp*, _Tp*, _Up*) [with _Tp = long unsigned int; _Up = long unsigned int; bool _IsMove = false]’,
    inlined from ‘_OI std::__copy_move_a2(_II, _II, _OI) [with bool _IsMove = false; _II = long unsigned int*; _OI = long unsigned int*]’ at /usr/include/c++/13/bits/stl_algobase.h:506:30,
    inlined from ‘_OI std::__copy_move_a1(_II, _II, _OI) [with bool _IsMove = false; _II = long unsigned int*; _OI = long unsigned int*]’ at /usr/include/c++/13/bits/stl_algobase.h:533:42,
    inlined from ‘_OI std::__copy_move_a(_II, _II, _OI) [with bool _IsMove = false; _II = long unsigned int*; _OI = long unsigned int*]’ at /usr/include/c++/13/bits/stl_algobase.h:540:31,
    inlined from ‘_OI std::copy(_II, _II, _OI) [with _II = long unsigned int*; _OI = long unsigned int*]’ at /usr/include/c++/13/bits/stl_algobase.h:633:7,
    inlined from ‘std::vector<bool, _Alloc>::iterator std::vector<bool, _Alloc>::_M_copy_aligned(const_iterator, const_iterator, iterator) [with _Alloc = std::allocator<bool>]’ at /usr/include/c++/13/bits/stl_bvector.h:1305:28,
    inlined from ‘void std::vector<bool, _Alloc>::_M_reallocate(size_type) [with _Alloc = std::allocator<bool>]’ at /usr/include/c++/13/bits/vector.tcc:851:40,
    inlined from ‘void std::vector<bool, _Alloc>::reserve(size_type) [with _Alloc = std::allocator<bool>]’ at /usr/include/c++/13/bits/stl_bvector.h:1093:17,
    inlined from ‘static std::enable_if_t<is_same_v<X, T>, decltype (X::forward(nullptr, (declval<Args>)()...))> torch::autograd::Function<T>::apply(Args&& ...) [with X = InferenceModeTest_TestCustomFunction_Test::TestBody()::MyFunction; Args = {at::Tensor&, int&, at::Tensor&}; T = InferenceModeTest_TestCustomFunction_Test::TestBody()::MyFunction]’ at /var/lib/jenkins/workspace/torch/csrc/autograd/custom_function.h:483:35,
    inlined from ‘virtual void InferenceModeTest_TestCustomFunction_Test::TestBody()’ at /var/lib/jenkins/workspace/test/cpp/api/inference_mode.cpp:644:47:
/usr/include/c++/13/bits/stl_algobase.h:437:30: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ forming offset 8 is out of the bounds [0, 8] [-Werror=array-bounds=]
  437 |             __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      |             ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 5, 2025

I think this is the corresponding gcc issue for the new warning we're seeing: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113239

let's silent it...

[ghstack-poisoned]
@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 5, 2025

@pytorchbot merge -f "AArch64 operator benchmarks are very flakey"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 1808a854a04d133fb81205b951d25a33b078c903 returned non-zero exit code 1

Auto-merging .ci/docker/build.sh
CONFLICT (content): Merge conflict in .ci/docker/build.sh
Auto-merging .github/workflows/docker-builds.yml
CONFLICT (content): Merge conflict in .github/workflows/docker-builds.yml
error: could not apply 1808a854a04... [ci][cpu] Update compiler to GCC-13 in jammy-aarch64
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 5, 2025

Darn, after rebasing, we now fail with another GCC13 warning -Werror=dangling-pointer introduce by #164991

test_aoti_abi_check/CMakeFiles/test_aoti_abi_check.dir/test_headeronlyarrayref.cpp.o -MF test_aoti_abi_check/CMakeFiles/test_aoti_abi_check.dir/test_headeronlyarrayref.cpp.o.d -o test_aoti_abi_check/CMakeFiles/test_aoti_abi_check.dir/test_headeronlyarrayref.cpp.o -c /var/lib/jenkins/workspace/test/cpp/aoti_abi_check/test_headeronlyarrayref.cpp
In file included from /usr/include/c++/13/bits/stl_uninitialized.h:63,
                 from /usr/include/c++/13/memory:69,
                 from /var/lib/jenkins/workspace/third_party/googletest/googletest/include/gtest/gtest.h:55,
                 from /var/lib/jenkins/workspace/test/cpp/aoti_abi_check/test_headeronlyarrayref.cpp:1:
In static member function ‘static _Up* std::__copy_move<_IsMove, true, std::random_access_iterator_tag>::__copy_m(_Tp*, _Tp*, _Up*) [with _Tp = const int; _Up = int; bool _IsMove = false]’,
    inlined from ‘_OI std::__copy_move_a2(_II, _II, _OI) [with bool _IsMove = false; _II = const int*; _OI = int*]’ at /usr/include/c++/13/bits/stl_algobase.h:506:30,
    inlined from ‘_OI std::__copy_move_a1(_II, _II, _OI) [with bool _IsMove = false; _II = const int*; _OI = int*]’ at /usr/include/c++/13/bits/stl_algobase.h:533:42,
    inlined from ‘_OI std::__copy_move_a(_II, _II, _OI) [with bool _IsMove = false; _II = const int*; _OI = int*]’ at /usr/include/c++/13/bits/stl_algobase.h:540:31,
    inlined from ‘_OI std::copy(_II, _II, _OI) [with _II = const int*; _OI = int*]’ at /usr/include/c++/13/bits/stl_algobase.h:633:7,
    inlined from ‘static _ForwardIterator std::__uninitialized_copy<true>::__uninit_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = const int*; _ForwardIterator = int*]’ at /usr/include/c++/13/bits/stl_uninitialized.h:147:27,
    inlined from ‘_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = const int*; _ForwardIterator = int*]’ at /usr/include/c++/13/bits/stl_uninitialized.h:185:15,
    inlined from ‘_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, _ForwardIterator, allocator<_Tp>&) [with _InputIterator = const int*; _ForwardIterator = int*; _Tp = int]’ at /usr/include/c++/13/bits/stl_uninitialized.h:373:37,
    inlined from ‘void std::vector<_Tp, _Alloc>::_M_range_initialize(_ForwardIterator, _ForwardIterator, std::forward_iterator_tag) [with _ForwardIterator = const int*; _Tp = int; _Alloc = std::allocator<int>]’ at /usr/include/c++/13/bits/stl_vector.h:1692:33,
    inlined from ‘std::vector<_Tp, _Alloc>::vector(_InputIterator, _InputIterator, const allocator_type&) [with _InputIterator = const int*; <template-parameter-2-2> = void; _Tp = int; _Alloc = std::allocator<int>]’ at /usr/include/c++/13/bits/stl_vector.h:708:23,
    inlined from ‘std::vector<T> c10::HeaderOnlyArrayRef<T>::vec() const [with T = int]’ at /var/lib/jenkins/workspace/torch/headeronly/util/HeaderOnlyArrayRef.h:236:64,
    inlined from ‘virtual void TestHeaderOnlyArrayRef_TestFromInitializerList_Test::TestBody()’ at /var/lib/jenkins/workspace/test/cpp/aoti_abi_check/test_headeronlyarrayref.cpp:39:26:
/usr/include/c++/13/bits/stl_algobase.h:437:30: error: using a dangling pointer to an unnamed temporary [-Werror=dangling-pointer=]
  437 |             __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      |             ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/lib/jenkins/workspace/test/cpp/aoti_abi_check/test_headeronlyarrayref.cpp: In member function ‘virtual void TestHeaderOnlyArrayRef_TestFromInitializerList_Test::TestBody()’:
/var/lib/jenkins/workspace/test/cpp/aoti_abi_check/test_headeronlyarrayref.cpp:38:52: note: unnamed temporary defined here
   38 |   HeaderOnlyArrayRef<int> arr({1, 2, 3, 4, 5, 6, 7});
      |                                                    ^
cc1plus: all warnings being treated as errors

@fadara01 fadara01 added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2025
[ghstack-poisoned]
@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 6, 2025

@pytorchbot merge --ignore-current "operator benchmarks are flakey"

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Nov 6, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: operator benchmarks are flakey

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@fadara01
Copy link
Copy Markdown
Collaborator Author

fadara01 commented Nov 6, 2025

@pytorchbot merge --ignore-current

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: operator_benchmark / x86-opbenchmark-test / test (cpu_operator_benchmark_short, 1, 1, linux.12xlarge), operator_benchmark / aarch64-opbenchmark-test / test (cpu_operator_benchmark_short, 1, 1, linux.arm64.m8g.4xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025
This is needed because manylinux uses GCC-13 since #152825
As a result of the current compiler version mismatches, we've seen tests
passing jammy-aarch64 pre-commit CI, but failing for wheels built in manylinux
Related to: #166736

ghstack-source-id: a4d92fc
Pull-Request: pytorch/pytorch#166849
@github-actions github-actions Bot deleted the gh/fadara01/7/head branch December 7, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: ci Related to continuous integration module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants