Skip to content

Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above#158117

Closed
robert-hardwick wants to merge 1 commit intopytorch:mainfrom
robert-hardwick:gcc_segfault
Closed

Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above#158117
robert-hardwick wants to merge 1 commit intopytorch:mainfrom
robert-hardwick:gcc_segfault

Conversation

@robert-hardwick
Copy link
Copy Markdown
Collaborator

@robert-hardwick robert-hardwick commented Jul 11, 2025

This PR disables strict-aliasing GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above.

Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ).

We Identified the problem also exists in GCC12 hence the __GNUC__ >= 12

Fixes #157626

fixes these tests failures when pytorch is built in GCC12 and above

test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

@robert-hardwick robert-hardwick added module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 arm priority labels Jul 11, 2025
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jul 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158117

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit b62f4d0 with merge base 4283d96 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 11, 2025
@robert-hardwick robert-hardwick added topic: not user facing topic category ciflow/linux-aarch64 linux aarch64 CI workflow and removed open source labels Jul 11, 2025
@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

Note - there isn't a CI workflow that will correctly validate this fix since #157748 is still in progress and is not for AArch64 anyway. Suggest manual install a built wheel and check the 5 tests above to validate.

@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

Note: this needs to go into 2.8.0 release as it fixes a regression

@Skylion007
Copy link
Copy Markdown
Collaborator

Note: this needs to go into 2.8.0 release as it fixes a regression

Do we have an upperbound after which strict aliasing starts working again?

@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

robert-hardwick commented Jul 11, 2025

Note: this needs to go into 2.8.0 release as it fixes a regression

Do we have an upperbound after which strict aliasing starts working again?

At this moment I don't know if there is an upper bound, but I can quickly check if gcc 15 is affected ( so far i know that 12, 13 and 14 are ) . The problem is I can't be sure if this is a compiler bug or whether there is some undefined behaviour in GridSamplerKernel, which is unlikely but can't be ruled out.

@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

On second thoughts, gcc-15 isn't available in manylinux yum repository yet - https://raw.repo.almalinux.org/almalinux/8/AppStream/aarch64/os/Packages/ so i can't check quickly : (

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Jul 11, 2025

@robert-hardwick do you mind enabling a tests for this?

@malfet malfet added this to the 2.8.0 milestone Jul 11, 2025
@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

@robert-hardwick do you mind enabling a tests for this?

not sure I follow, could you clarify what you mean here? The failing tests are already enabled in linux-aarch64.yml.

@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 15, 2025
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@robert-hardwick
Copy link
Copy Markdown
Collaborator Author

Landing in main so that a cherry-pick PR can be created for release 2.8.0 branch cut.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arm priority ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation faults in test_ops.py tests with gcc13 on AArch64 (v1)

5 participants