Skip to content

[CUDA] Use runtime driver API for cuStreamWriteValue32#158295

Closed
eee4017 wants to merge 20 commits intopytorch:mainfrom
eee4017:main-fix-symmetric-memory-api-call-new
Closed

[CUDA] Use runtime driver API for cuStreamWriteValue32#158295
eee4017 wants to merge 20 commits intopytorch:mainfrom
eee4017:main-fix-symmetric-memory-api-call-new

Conversation

@eee4017
Copy link
Collaborator

@eee4017 eee4017 commented Jul 14, 2025

@eee4017 eee4017 requested review from eqy and syed-ahmed as code owners July 14, 2025 23:28
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158295

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit c4a63b8 with merge base da05b7f (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 14, 2025
@huydhn
Copy link
Contributor

huydhn commented Jul 14, 2025

I have this PR to add the older driver check to CI #158300. We probably want to merge that first, then rebase this PR on top?

@ngimel ngimel added this to the 2.8.0 milestone Jul 15, 2025
@ngimel ngimel added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 15, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 15, 2025
@ngimel ngimel added ciflow/trunk Trigger trunk jobs on your pull request ciflow/h100-distributed labels Jul 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 15, 2025

To add the ciflow label ciflow/h100-distributed please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 15, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed ciflow/h100-distributed ciflow/trunk Trigger trunk jobs on your pull request labels Jul 15, 2025
@ngimel
Copy link
Collaborator

ngimel commented Jul 15, 2025

@huydhn Since you've manually checked this PR on an older driver, I think it's ok to merge it

@eqy
Copy link
Collaborator

eqy commented Jul 15, 2025

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased main-fix-symmetric-memory-api-call-new onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout main-fix-symmetric-memory-api-call-new && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the main-fix-symmetric-memory-api-call-new branch from dcdc640 to 68d505b Compare July 15, 2025 17:50
@huydhn
Copy link
Contributor

huydhn commented Jul 16, 2025

@pytorchbot rebase -b main

@pytorchmergebot pytorchmergebot force-pushed the main-fix-symmetric-memory-api-call-new branch from 95066e0 to c4a63b8 Compare July 16, 2025 17:49
@pytorch-bot pytorch-bot bot removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 16, 2025
@huydhn huydhn added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 16, 2025
@huydhn
Copy link
Contributor

huydhn commented Jul 16, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 16, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / build

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Collaborator

ngimel commented Jul 16, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 4, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck)

Details for Dev Infra team Raised by workflow job

@ngimel ngimel removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 16, 2025
@ngimel
Copy link
Collaborator

ngimel commented Jul 16, 2025

@pytorchbot merge -i

@atalman
Copy link
Contributor

atalman commented Jul 17, 2025

@pytorchbot cherry-pick --onto release/2.8 -c critical

pytorchbot pushed a commit that referenced this pull request Jul 17, 2025
Reopen #156097

Fixes #154073

Reference: NVIDIA/Fuser#4197

See PR #156097 and #154097

Pull Request resolved: #158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn

Co-authored-by: Wei Wang <weiwan@nvidia.com>
(cherry picked from commit a9f902a)
@pytorchbot
Copy link
Collaborator

Cherry picking #158295

The cherry pick PR is at #158585 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

atalman pushed a commit that referenced this pull request Jul 18, 2025
[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295)

Reopen #156097

Fixes #154073

Reference: NVIDIA/Fuser#4197

See PR #156097 and #154097

Pull Request resolved: #158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn


(cherry picked from commit a9f902a)

Co-authored-by: Frank Lin <eee4017@gmail.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
tvukovic-amd pushed a commit to ROCm/pytorch that referenced this pull request Aug 20, 2025
[CUDA] Use runtime driver API for cuStreamWriteValue32 (pytorch#158295)

Reopen pytorch#156097

Fixes pytorch#154073

Reference: NVIDIA/Fuser#4197

See PR pytorch#156097 and pytorch#154097

Pull Request resolved: pytorch#158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn


(cherry picked from commit a9f902a)

Co-authored-by: Frank Lin <eee4017@gmail.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source test-config/default topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: CUDA driver error: operation not supported with test_stream_write_value32 and cuStreamWriteValue32

9 participants