Skip to content

[BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues#1425

Merged
vmoens merged 2 commits intopytorch:mainfrom
ji-huazhong:fix
Sep 2, 2025
Merged

[BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues#1425
vmoens merged 2 commits intopytorch:mainfrom
ji-huazhong:fix

Conversation

@ji-huazhong
Copy link
Copy Markdown
Contributor

@ji-huazhong ji-huazhong commented Aug 26, 2025

Description

Following up comment, this PR adresses the preicison issue when move tensordict from 3rd-party device to cpu.

cc @vmoens

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 26, 2025
Copy link
Copy Markdown
Collaborator

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
Can you further explain why this is needed and what problem is solved?
Thanks!

Comment thread tensordict/base.py
@ji-huazhong
Copy link
Copy Markdown
Contributor Author

ji-huazhong commented Aug 26, 2025

Thanks for this PR! Can you further explain why this is needed and what problem is solved? Thanks!

Hi @vmoens We have recently been using VERL for post-training our model on the Ascend NPUs. We observed that as the training progresses, the gradient norm (grad norm) consistently diverges (runs out of control). In contrast, all metrics converge well when training on NV GPUs.

Coincidentally, we found that adding synchronization (torch.npu.synchronize) at the connection points between different components in the RL workflow resolved this issue, and it no longer recurs.

In VERL, a single controller distributes data(organized in the form of TensorDict) to various workers (each associated with a device) via Ray for computation. After the computation is completed, td.to("cpu") is called to move the data from the device to the CPU (see here). The data is then aggregated back to the single controller through Ray, which then proceeds to execute the next phase of computation jobs. It was precisely by adding synchronization after td.to("cpu") that the gradient norm divergence issue ceased to occur.

Through further investigation, we discovered that the to operation of TensorDict appears to be non-blocking. Meanwhile, it ensures that when the target device is specified as to('cpu'), the _sync_all method is invoked to guarantee that the data transfer from the device to the CPU is successfully completed before the data is used. see:

tensordict/tensordict/base.py

Lines 14130 to 14136 in 5e78151

if (
device is not None
and sub_non_blocking
and not non_blocking
and device.type != "cuda"
):
self._sync_all()

tensordict/tensordict/base.py

Lines 14260 to 14268 in 5e78151

def _sync_all(self):
if self._has_cuda:
# TODO: dynamo doesn't like torch.cuda.is_initialized
if not is_compiling() and torch.cuda.is_initialized():
torch.cuda.synchronize()
elif self._has_mps:
mps = getattr(torch, "mps", None)
if mps is not None:
mps.synchronize()

However, this guarantee only covers CUDA and MPS devices. PyTorch also supports third-party hardware such as Intel XPU and Ascend NPU, and this PR aims to extend this guarantee to cover these third-party hardware platforms as well.

@ji-huazhong ji-huazhong changed the title [BugFix] consistent use of non_blocking in tensordict and torch.Tensor on non-cuda devices [BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues Aug 26, 2025
@vmoens vmoens added the bug Something isn't working label Sep 2, 2025
Copy link
Copy Markdown
Collaborator

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I like it!

@vmoens vmoens merged commit 4c1766e into pytorch:main Sep 2, 2025
56 of 79 checks passed
@ji-huazhong ji-huazhong deleted the fix branch September 3, 2025 08:02
@FightingZhen
Copy link
Copy Markdown

Great to see this PR merged! When will the next version containing this feature be released? @vmoens

@vmoens
Copy link
Copy Markdown
Collaborator

vmoens commented Sep 3, 2025

Hoping to release this by end of week but I have a bunch of CI issues...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants