[RelEng] Define `BUILD_BUNDLE_PTXAS` by malfet · Pull Request #119750 · pytorch/pytorch

malfet · 2024-02-13T04:07:31Z

That would bundle PTXAS into a bin folder

When compiling for Triton, define TRITION_PTXAS_PATH if ptxas is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas

Needs pytorch/builder@5c814e2 to produce valid binary builds

Test plan:

Create dummy ptxas in torch/bin folder and observe torch.compile fail with backtrace in Triton module.
Run following script (to be added to binary tests ) against CUDA-11.8 wheel:

import torch
import triton

@torch.compile
def foo(x: torch.Tensor) -> torch.Tensor:
  return torch.sin(x) + torch.cos(x)

x=torch.rand(3, 3, device="cuda")
print(foo(x))
# And check that CUDA versions match
cuda_version = torch.version.cuda
ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii")
assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}"

Fixes #119054

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

That would bundle PTXAS into a `bin` folder When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas Fixes #119054

pytorch-bot · 2024-02-13T04:07:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119750

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 10b2549 with merge base 02b60e7 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-py3.12-clang10 / build (gh)
Process completed with exit code 1.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_export/serde/serialize.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

I would have a couple questions here:

How do we test this?
Are we allowed to redistributed this binary from nvidia?
Do we care about the binary size increase? ptxas is 25MB on my local install.

malfet · 2024-02-13T15:31:48Z

How do we test this?

Alas manually for now(Would be very easy if we had a system with CUDA-11 driver in CI, but that's a different story)

Are we allowed to redistributed this binary from nvidia?

Yes (also Triton does it already)

Do we care about the binary size increase? ptxas is 25MB on my local install.

This supposed to be used only for CUDA-11.8 wheel (which does not go into PyPI) Alternatively one can do it by introducing a dependency to nvidia-cuda-nvcc-cu11, but that would bring much more in terms of transitive dependencies.

albanD · 2024-02-13T18:58:17Z

This supposed to be used only for CUDA-11.8 wheel

Why only for that version?
Should this be done for any binary we generate where the version of cuda we build with does not match the ptxas inside the triton package we are pinned to?

malfet · 2024-02-13T21:20:39Z

How do we test this?

Alas manually for now(Would be very easy if we had a system with CUDA-11 driver in CI, but that's a different story)

Actually we can perhaps during binary builds, let me try adding binary build that queries ptx version for _dynamo which would be a good indicator

malfet · 2024-02-15T02:07:00Z

@pytorchbot merge -f "Binary tests are green"

pytorchmergebot · 2024-02-15T02:08:45Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

That would bundle PTXAS into a `bin` folder When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas Needs pytorch/builder@5c814e2 to produce valid binary builds Test plan: - Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module. - Run following script (to be added to binary tests ) against CUDA-11.8 wheel: ```python import torch import triton @torch.compile def foo(x: torch.Tensor) -> torch.Tensor: return torch.sin(x) + torch.cos(x) x=torch.rand(3, 3, device="cuda") print(foo(x)) # And check that CUDA versions match cuda_version = torch.version.cuda ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii") assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}" ``` Fixes pytorch#119054 Pull Request resolved: pytorch#119750 Approved by: https://github.com/jansel, https://github.com/atalman

Co-authored-by: Nikita Shulga <nshulga@meta.com> Fixes #119054 resolved: #119750

…63988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark. Fixes #163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: #119750 and pytorch/builder@5c814e2 Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: #163988 Approved by: https://github.com/atalman

…63988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark. Fixes #163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: #119750 and pytorch/builder@5c814e2 Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: #163988 Approved by: https://github.com/atalman (cherry picked from commit 3b4ad4a)

…64236) [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988) See also #163972, which was intended to be this PR. Triton (release/3.5.x) by default ships CUDA12.8 ptxas. This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark. Fixes #163801 Test Plan: Check binary size increase against nightly or v2.9RC Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression. Reference: #119750 and pytorch/builder@5c814e2 Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary. However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available) Pull Request resolved: #163988 Approved by: https://github.com/atalman (cherry picked from commit 3b4ad4a) Co-authored-by: Wei Wang <weiwan@nvidia.com>

malfet requested review from albanD, atalman and jansel February 13, 2024 04:07

github-actions bot added module: inductor ciflow/inductor labels Feb 13, 2024

albanD reviewed Feb 13, 2024

View reviewed changes

jansel approved these changes Feb 13, 2024

View reviewed changes

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2024

atalman approved these changes Feb 14, 2024

View reviewed changes

malfet added 2 commits February 14, 2024 19:43

Add install instruction (otherwise it would not be packages)

5460458

And install it as an executable

10b2549

pytorchmergebot added the merging label Feb 15, 2024

pytorchmergebot closed this in 516f38a Feb 15, 2024

pytorchmergebot added Merged and removed merging labels Feb 15, 2024

This was referenced Feb 15, 2024

[RelEng] Define BUILD_BUNDLE_PTXAS (#119750) #119988

Merged

[v2.2.1] Release Tracker #119295

Closed

atalman added a commit that referenced this pull request Feb 15, 2024

[RelEng] Define BUILD_BUNDLE_PTXAS (#119750) (#119988)

6c8c5ad

Co-authored-by: Nikita Shulga <nshulga@meta.com> Fixes #119054 resolved: #119750

ptrblck mentioned this pull request Feb 17, 2024

BackendCompilerFailed: backend=‘inductor’ raised: RuntimeError: Triton Error [CUDA]: device kernel image is invalid #119054

Closed

github-actions bot deleted the malfet/bundle-and-use-ptxas branch March 17, 2024 01:51

malfet mentioned this pull request Jul 17, 2024

[RUNTIME] Fix the function lookup problem for CUDA 11 driver triton-lang/triton#4335

Merged

malfet mentioned this pull request Sep 25, 2025

[CUDA][Triton][PTXAS] Triton Wheel Missing CUDA13 PTXAS - Breakage exists for the environment where CTK is not present #163801

Closed

This was referenced Sep 26, 2025

[CD][CUDA][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 for CUDA13 X86 Wheel Build #163972

Closed

[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 #163988

Closed

pytorchbot mentioned this pull request Sep 30, 2025

[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 #164236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RelEng] Define `BUILD_BUNDLE_PTXAS`#119750

[RelEng] Define `BUILD_BUNDLE_PTXAS`#119750
malfet wants to merge 3 commits intomainfrom
malfet/bundle-and-use-ptxas

malfet commented Feb 13, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 13, 2024 •

edited

Loading

Uh oh!

albanD left a comment

Uh oh!

malfet commented Feb 13, 2024

Uh oh!

albanD commented Feb 13, 2024

Uh oh!

malfet commented Feb 13, 2024

Uh oh!

malfet commented Feb 15, 2024

Uh oh!

pytorchmergebot commented Feb 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

malfet commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119750

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Feb 13, 2024

Uh oh!

albanD commented Feb 13, 2024

Uh oh!

malfet commented Feb 13, 2024

Uh oh!

malfet commented Feb 15, 2024

Uh oh!

pytorchmergebot commented Feb 15, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Feb 13, 2024 •

edited

Loading

pytorch-bot bot commented Feb 13, 2024 •

edited

Loading