Skip to content

[CUDA] Use runtime driver API for cuStreamWriteValue32#156097

Closed
eee4017 wants to merge 20 commits intopytorch:mainfrom
eee4017:main-fix-symmetric-memory-api-call
Closed

[CUDA] Use runtime driver API for cuStreamWriteValue32#156097
eee4017 wants to merge 20 commits intopytorch:mainfrom
eee4017:main-fix-symmetric-memory-api-call

Conversation

@eee4017
Copy link
Collaborator

@eee4017 eee4017 commented Jun 16, 2025

Fixes #154073

Reference: NVIDIA/Fuser#4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

@eee4017 eee4017 requested review from eqy and syed-ahmed as code owners June 16, 2025 16:02
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 16, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156097

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit dabccdb with merge base 7f14b42 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@eqy eqy added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 16, 2025
@eqy eqy requested a review from ngimel June 16, 2025 16:45
ngimel
ngimel previously approved these changes Jun 16, 2025
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix lint

@eee4017 eee4017 force-pushed the main-fix-symmetric-memory-api-call branch from 96693fe to ad432d9 Compare June 17, 2025 05:40
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 17, 2025
@Aidyn-A Aidyn-A added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 17, 2025
cyyever
cyyever previously approved these changes Jun 17, 2025
@eee4017
Copy link
Collaborator Author

eee4017 commented Jun 17, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@atalman
Copy link
Contributor

atalman commented Jun 18, 2025

@pytorchmergebot revert -c nosiganl -m "break internal tests"

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 18, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'nosiganl' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@atalman
Copy link
Contributor

atalman commented Jun 18, 2025

@pytorchmergebot revert -c nosignal -m "break internal tests"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@eee4017 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jun 18, 2025
@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jun 18, 2025
@pytorch-bot pytorch-bot bot dismissed ngimel’s stale review June 18, 2025 21:48

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

#undef LOOKUP_LIBCUDA_ENTRY
#define LOOKUP_LIBCUDA_ENTRY_WITH_VERSION_OPTIONAL(name, version) \
r.name##_ = reinterpret_cast<decltype(&name)>(get_symbol(#name, version));
C10_LIBCUDA_DRIVER_API_OPTIONAL(LOOKUP_LIBCUDA_ENTRY_WITH_VERSION_OPTIONAL)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion was removed to allow compatibility with older drivers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! Let me do a round of testing on Monday to confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerun the manual test with the latest change from the PR https://github.com/pytorch/pytorch/actions/runs/16276707651/job/45957270665?pr=158181

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it's sill running, I think the test looks good now because:

  1. It correctly uses older NVIDIA driver https://github.com/pytorch/pytorch/actions/runs/16276707651/job/45959318683?pr=158181#step:13:764
  2. Tests are running ok instead of crashing

@huydhn
Copy link
Contributor

huydhn commented Jul 14, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Collaborator

ngimel commented Jul 14, 2025

@eee4017 sorry can you please create a new pull request?

@eee4017
Copy link
Collaborator Author

eee4017 commented Jul 14, 2025

Hi, it was created here: #158295

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@eee4017
Copy link
Collaborator Author

eee4017 commented Sep 16, 2025

Closed with #158585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category Reverted Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: CUDA driver error: operation not supported with test_stream_write_value32 and cuStreamWriteValue32