You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have identified 5 tests that are failing due to segmentation fault on AAarch64 neoverse-v1. ( neoverse-v2 i.e. aws c8g seems to be unaffected ).
How we identified this - Our workflow is to test the unit tests with a manywheel build where as linux-aarch64.yml workflow builds inside a jammy container. This is why these tests are currently passing in CI.
This can be reproduced consistently with a nightly build. You will need a neoverse-v1 ( e.g. aws c7g ). First install the pytorch nightly then run any of these tests.
We have identified the cause to be this PR - #152825 which was merged about 2 months ago.
I have confirmed by reverting this patch that all of the above tests pass again. This explains why the CI is currently passing because the PR did not upgrade jammy to gcc13 simultaneously. AFAIK in linux-aarch64.yml the build is executed in a jammy container not the manylinux container.
Next Steps
There are a few possible resolutions we could take here
🐛 Describe the bug
We have identified 5 tests that are failing due to segmentation fault on AAarch64 neoverse-v1. ( neoverse-v2 i.e. aws c8g seems to be unaffected ).
How we identified this - Our workflow is to test the unit tests with a manywheel build where as linux-aarch64.yml workflow builds inside a jammy container. This is why these tests are currently passing in CI.
How to reproduce
This can be reproduced consistently with a nightly build. You will need a neoverse-v1 ( e.g. aws c7g ). First install the pytorch nightly then run any of these tests.
Cause
We have identified the cause to be this PR - #152825 which was merged about 2 months ago.
I have confirmed by reverting this patch that all of the above tests pass again. This explains why the CI is currently passing because the PR did not upgrade jammy to gcc13 simultaneously. AFAIK in linux-aarch64.yml the build is executed in a jammy container not the manylinux container.
Next Steps
There are a few possible resolutions we could take here
Versions
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01