Skip to content

Added philox based RNG context for HPU device in Dtensor scenarios#156581

Closed
pralay-das wants to merge 4 commits intopytorch:mainfrom
pralay-das:feat/dtensor_rng_hpu
Closed

Added philox based RNG context for HPU device in Dtensor scenarios#156581
pralay-das wants to merge 4 commits intopytorch:mainfrom
pralay-das:feat/dtensor_rng_hpu

Conversation

@pralay-das
Copy link
Contributor

@pralay-das pralay-das commented Jun 23, 2025

In this PR, we are enabling HPU device-specific function calls for random operations. These calls will manage the setting and unsetting of the context of Random Number Generator.
While HPU devices typically utilize a Mersenne-based RNG, Dtensor-specific random operations employ an offset-based (Philox) RNG tracker which is specifically integrated with CUDA in scope.
To integrate a similar offset-based RNG tracker within the HPU backend, a backend-specific device handle function is necessary to identify the execution context of these random operations.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156581

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6fcbadf with merge base d1b4e0f (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 23, 2025
@pralay-das
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pralay-das
Copy link
Contributor Author

hi @zhangxiaoli73, @wconstab could you review this PR?

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inlined comments. I think:

  • It would be better for different device to align on the usage of random seeds.
  • we should be very careful about any modification for _dispatch.py as it would increment runtime CPU overhead for each operator. If you really want to do this, it should be done in the random module instead

@pralay-das pralay-das requested a review from wanchaol July 1, 2025 08:39
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, I wonder if there's a way for you to add some tests to test hpu..

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 7, 2025
@jeromean
Copy link

jeromean commented Jul 8, 2025

sgtm, I wonder if there's a way for you to add some tests to test hpu..

@wanchaol, in addition to internal validation, we are currently collaborating with "accelerator-integration-wg" to enable out-of-tree accelerators and that could aid in validating these sorts of scenarios.

@pralay-das
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 8, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants