[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 by nWEIdia · Pull Request #163988 · pytorch/pytorch

nWEIdia · 2025-09-26T20:03:15Z

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @ptrblck @eqy @tinglvv @atalman @malfet

…UDA13 Wheel Build See also pytorch#163972

pytorch-bot · 2025-09-26T20:03:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163988

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 1a3aea4 with merge base 5880996 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

windows-binary-wheel / wheel-py3_10-xpu-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
windows-binary-wheel / wheel-py3_12-xpu-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
windows-binary-wheel / wheel-py3_14t-xpu-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nWEIdia · 2025-09-27T02:19:06Z

Binary size information: (this PR's artifact, python3.12)
523M Sep 27 02:00 torch-2.10.0.dev20250926+cu128-cp312-cp312-manylinux_2_28_aarch64.whl
ptxas size and location: -rwxr-xr-x 1 root root 31M Sep 27 02:12 /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas

Check functionality: (e.g. on THOR)
unset TRITON_PTXAS_PATH
#clone pytorch
Run: python test/inductor/test_control_flow.py CondTests.test_cond_mismatched_branch_output_size_device_cuda_dynamic_False
Still encounters: "ptxas fatal : Value 'sm_110a' is not defined for option 'gpu-name'"
Needs to fix this part.

nWEIdia · 2025-09-27T02:28:11Z

Torch Compile now expect ptxas to be: /usr/local/lib/python3.12/dist-packages/torch/_inductor/bin/ptxas
So either we need to change this, or we need to package to torch/_inductor/bin/ptxas, rather than torch/bin/ptxas.

I would just change the expected directory to be /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas again to reduce packaging risks.

to /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas

nWEIdia · 2025-09-27T16:58:31Z

Test Results on THOR with the latest wheels:

gh run download 18053880730 -n manywheel-py3_12-cuda-aarch64-13_0
pip install torch-2.10.0.dev20250927+cu130-cp312-cp312-manylinux_2_28_aarch64.whl --index-url https://download.pytorch.org/whl/nightly/cu130 (--index-url is trying to satisfy pytorch_triton availability)

root@:/workspace/pytorch# python test/inductor/test_control_flow.py CondTests.test_cond_mismatched_branch_output_size_device_cuda_dynamic_False
inline_call []
stats [('calls_captured', 22), ('unique_graphs', 2)]
inductor [('async_compile_cache_miss', 5), ('extern_calls', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('ok', 1)]
graph_break []
.

Ran 1 test in 2.326s

OK
root@:/workspace/pytorch# echo $TRITON_PTXAS_PATH

root@:/workspace/pytorch# pip list |grep torch
pytorch-triton 3.5.0+gitbbb06c03
torch 2.10.0.dev20250927+cu130

nWEIdia · 2025-09-27T17:01:42Z

On the other device:
Found GPU0 NVIDIA **** which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)

warnings.warn(
inline_call []
stats [('calls_captured', 22), ('unique_graphs', 2)]
inductor [('async_compile_cache_miss', 5), ('extern_calls', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('ok', 1)]
graph_break []
.

Ran 1 test in 1.441s

OK

nWEIdia · 2025-09-29T20:38:11Z

@pytorchbot merge

pytorchmergebot · 2025-09-29T20:40:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2025-09-30T12:55:43Z

@pytorchbot cherry-pick --onto release/2.9 --fixes "Critical CI fix" -c critical

…63988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark. Fixes #163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: #119750 and pytorch/builder@5c814e2 Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: #163988 Approved by: https://github.com/atalman (cherry picked from commit 3b4ad4a)

pytorchbot · 2025-09-30T13:01:53Z

Cherry picking #163988

The cherry pick PR is at #164236 and it is linked with issue Critical CI fix. The following tracker issues are updated:

[v.2.9.0] Release Tracker #162497 (comment)

Details for Dev Infra team

Raised by workflow job

…64236) [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark. Fixes #163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: #119750 and pytorch/builder@5c814e2 Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: #163988 Approved by: https://github.com/atalman (cherry picked from commit 3b4ad4a) Co-authored-by: Wei Wang <weiwan@nvidia.com>

…4716) The ptxas bundling was introduced in #163988 to workaround issues users may face due to #163801 Fortunately, on the triton upstream side, triton-lang/triton@884fdae finally landed, which is means #163801 is permanently fixed. In addition, pytorch's triton commit pin has been updated via #178821 We can now roll back #163801 . In between, we unified the arm sbsa build with x86, so revert won't work. Manually reverting the export. Test plan: download and check the binary size to confirm 1) ptxas is gone from both x86 and sbsa (even though I only added to sbsa cu13 initially) 2) unit test that ran on #163988 should still pass. Pull Request resolved: #174716 Approved by: https://github.com/tinglvv, https://github.com/atalman

[AARCH64][CD][CUDA][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 for C…

325768a

…UDA13 Wheel Build See also pytorch#163972

nWEIdia requested a review from a team as a code owner September 26, 2025 20:03

nWEIdia mentioned this pull request Sep 26, 2025

[CD][CUDA][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 for CUDA13 X86 Wheel Build #163972

Closed

nWEIdia added release notes: build release notes category ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/linux-aarch64 linux aarch64 CI workflow and removed ciflow/linux-aarch64 linux aarch64 CI workflow labels Sep 26, 2025

pytorchbot added the open source label Sep 26, 2025

Change /usr/local/lib/python3.12/dist-packages/torch/_inductor/bin/ptxas

1a3aea4

to /usr/local/lib/python3.12/dist-packages/torch/bin/ptxas

pytorch-bot Bot added ciflow/inductor module: inductor labels Sep 27, 2025

nWEIdia requested review from atalman and malfet September 28, 2025 01:28

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 29, 2025

nWEIdia added this to PyTorch + CUDA Sep 29, 2025

nWEIdia moved this to Hi Priority in PyTorch + CUDA Sep 29, 2025

nWEIdia self-assigned this Sep 29, 2025

atalman approved these changes Sep 29, 2025

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 29, 2025

pytorchmergebot added the merging label Sep 29, 2025

pytorchmergebot added the Merged label Sep 30, 2025

pytorchmergebot closed this in 3b4ad4a Sep 30, 2025

github-project-automation Bot moved this from Hi Priority to Done in PyTorch + CUDA Sep 30, 2025

pytorchmergebot removed the merging label Sep 30, 2025

pytorchbot mentioned this pull request Sep 30, 2025

[v.2.9.0] Release Tracker #162497

Closed

This was referenced Feb 5, 2026

[CUDA][Triton][PTXAS] Triton Wheel Missing CUDA13 PTXAS - Breakage exists for the environment where CTK is not present #163801

Closed

[CD][CUDA13][PTXAS] Remove ptxas bundle from PyTorch cu13 Binary #174716

Closed

nWEIdia mentioned this pull request Mar 10, 2026

PyTorch 2.10-2.11 Linux x86 Binary size increase investigation #177050

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 #163988

[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 #163988
nWEIdia wants to merge 2 commits intopytorch:mainfrom
nWEIdia:aarch64-turn-on-BUILD_BUNDLE_PTXAS

nWEIdia commented Sep 26, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Sep 26, 2025 •

edited

Loading

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

atalman commented Sep 30, 2025

Uh oh!

pytorchbot commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nWEIdia commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163988

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 27, 2025

Uh oh!

nWEIdia commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Merge started

Uh oh!

atalman commented Sep 30, 2025

Uh oh!

pytorchbot commented Sep 30, 2025

Cherry picking #163988

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nWEIdia commented Sep 26, 2025 •

edited

Loading

pytorch-bot Bot commented Sep 26, 2025 •

edited

Loading