Skip to content

[RelEng] Define BUILD_BUNDLE_PTXAS#119750

Closed
malfet wants to merge 3 commits intomainfrom
malfet/bundle-and-use-ptxas
Closed

[RelEng] Define BUILD_BUNDLE_PTXAS#119750
malfet wants to merge 3 commits intomainfrom
malfet/bundle-and-use-ptxas

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Feb 13, 2024

That would bundle PTXAS into a bin folder

When compiling for Triton, define TRITION_PTXAS_PATH if ptxas is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas

Needs pytorch/builder@5c814e2 to produce valid binary builds

Test plan:

  • Create dummy ptxas in torch/bin folder and observe torch.compile fail with backtrace in Triton module.
  • Run following script (to be added to binary tests ) against CUDA-11.8 wheel:
import torch
import triton

@torch.compile
def foo(x: torch.Tensor) -> torch.Tensor:
  return torch.sin(x) + torch.cos(x)

x=torch.rand(3, 3, device="cuda")
print(foo(x))
# And check that CUDA versions match
cuda_version = torch.version.cuda
ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii")
assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}"

Fixes #119054

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

That would bundle PTXAS into a `bin` folder

When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch
Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas

Fixes #119054
@malfet malfet requested review from albanD, atalman and jansel February 13, 2024 04:07
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119750

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 10b2549 with merge base 02b60e7 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have a couple questions here:

  1. How do we test this?
  2. Are we allowed to redistributed this binary from nvidia?
  3. Do we care about the binary size increase? ptxas is 25MB on my local install.

@malfet
Copy link
Contributor Author

malfet commented Feb 13, 2024

  1. How do we test this?

Alas manually for now(Would be very easy if we had a system with CUDA-11 driver in CI, but that's a different story)

  1. Are we allowed to redistributed this binary from nvidia?

Yes (also Triton does it already)

  1. Do we care about the binary size increase? ptxas is 25MB on my local install.

This supposed to be used only for CUDA-11.8 wheel (which does not go into PyPI) Alternatively one can do it by introducing a dependency to nvidia-cuda-nvcc-cu11, but that would bring much more in terms of transitive dependencies.

@albanD
Copy link
Collaborator

albanD commented Feb 13, 2024

This supposed to be used only for CUDA-11.8 wheel

Why only for that version?
Should this be done for any binary we generate where the version of cuda we build with does not match the ptxas inside the triton package we are pinned to?

@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2024
@malfet
Copy link
Contributor Author

malfet commented Feb 13, 2024

  1. How do we test this?

Alas manually for now(Would be very easy if we had a system with CUDA-11 driver in CI, but that's a different story)

Actually we can perhaps during binary builds, let me try adding binary build that queries ptx version for _dynamo which would be a good indicator

@malfet
Copy link
Contributor Author

malfet commented Feb 15, 2024

@pytorchbot merge -f "Binary tests are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

atalman pushed a commit to atalman/pytorch that referenced this pull request Feb 15, 2024
That would bundle PTXAS into a `bin` folder

When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas

Needs pytorch/builder@5c814e2 to produce valid binary builds

Test plan:
- Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module.
- Run following script (to be added to binary tests ) against CUDA-11.8 wheel:
```python
import torch
import triton

@torch.compile
def foo(x: torch.Tensor) -> torch.Tensor:
  return torch.sin(x) + torch.cos(x)

x=torch.rand(3, 3, device="cuda")
print(foo(x))
# And check that CUDA versions match
cuda_version = torch.version.cuda
ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii")
assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}"
```

Fixes pytorch#119054

Pull Request resolved: pytorch#119750
Approved by: https://github.com/jansel, https://github.com/atalman
atalman added a commit that referenced this pull request Feb 15, 2024
Co-authored-by: Nikita Shulga <nshulga@meta.com>
Fixes #119054
resolved: #119750
@github-actions github-actions bot deleted the malfet/bundle-and-use-ptxas branch March 17, 2024 01:51
pytorchmergebot pushed a commit that referenced this pull request Sep 30, 2025
…63988)

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: #163988
Approved by: https://github.com/atalman
pytorchbot pushed a commit that referenced this pull request Sep 30, 2025
…63988)

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: #163988
Approved by: https://github.com/atalman

(cherry picked from commit 3b4ad4a)
atalman pushed a commit that referenced this pull request Sep 30, 2025
…64236)

[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1   (#163988)

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help #163801 when users run on new devices like THOR and Spark.

Fixes #163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: #119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: #163988
Approved by: https://github.com/atalman

(cherry picked from commit 3b4ad4a)

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BackendCompilerFailed: backend=‘inductor’ raised: RuntimeError: Triton Error [CUDA]: device kernel image is invalid

5 participants