[BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues by ji-huazhong · Pull Request #1425 · pytorch/tensordict

ji-huazhong · 2025-08-26T03:37:43Z

Description

Following up comment, this PR adresses the preicison issue when move tensordict from 3rd-party device to cpu.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

vmoens

Thanks for this PR!
Can you further explain why this is needed and what problem is solved?
Thanks!

…party device

ji-huazhong · 2025-08-26T09:00:42Z

Thanks for this PR! Can you further explain why this is needed and what problem is solved? Thanks!

Hi @vmoens We have recently been using VERL for post-training our model on the Ascend NPUs. We observed that as the training progresses, the gradient norm (grad norm) consistently diverges (runs out of control). In contrast, all metrics converge well when training on NV GPUs.

Coincidentally, we found that adding synchronization (torch.npu.synchronize) at the connection points between different components in the RL workflow resolved this issue, and it no longer recurs.

In VERL, a single controller distributes data(organized in the form of TensorDict) to various workers (each associated with a device) via Ray for computation. After the computation is completed, td.to("cpu") is called to move the data from the device to the CPU (see here). The data is then aggregated back to the single controller through Ray, which then proceeds to execute the next phase of computation jobs. It was precisely by adding synchronization after td.to("cpu") that the gradient norm divergence issue ceased to occur.

Through further investigation, we discovered that the to operation of TensorDict appears to be non-blocking. Meanwhile, it ensures that when the target device is specified as to('cpu'), the _sync_all method is invoked to guarantee that the data transfer from the device to the CPU is successfully completed before the data is used. see:

tensordict/tensordict/base.py

Lines 14130 to 14136 in 5e78151

    
           if ( 
        
               device is not None 
        
               and sub_non_blocking 
        
               and not non_blocking 
        
               and device.type != "cuda" 
        
           ): 
        
               self._sync_all()

tensordict/tensordict/base.py

Lines 14260 to 14268 in 5e78151

    
           def _sync_all(self): 
        
               if self._has_cuda: 
        
                   # TODO: dynamo doesn't like torch.cuda.is_initialized 
        
                   if not is_compiling() and torch.cuda.is_initialized(): 
        
                       torch.cuda.synchronize() 
        
               elif self._has_mps: 
        
                   mps = getattr(torch, "mps", None) 
        
                   if mps is not None: 
        
                       mps.synchronize()

However, this guarantee only covers CUDA and MPS devices. PyTorch also supports third-party hardware such as Intel XPU and Ascend NPU, and this PR aims to extend this guarantee to cover these third-party hardware platforms as well.

vmoens

LGTM I like it!

FightingZhen · 2025-09-03T09:49:50Z

Great to see this PR merged! When will the next version containing this feature be released? @vmoens

vmoens · 2025-09-03T09:51:56Z

Hoping to release this by end of week but I have a bunch of CI issues...

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 26, 2025

vmoens reviewed Aug 26, 2025

View reviewed changes

Comment thread tensordict/base.py

consistent use of non_blocking in tensordict and torch.Tensor on 3rd-…

adec590

…party device

ji-huazhong changed the title ~~[BugFix] consistent use of non_blocking in tensordict and torch.Tensor on non-cuda devices~~ [BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues Aug 26, 2025

more fix

4f1f956

vmoens added the bug Something isn't working label Sep 2, 2025

vmoens approved these changes Sep 2, 2025

View reviewed changes

vmoens merged commit 4c1766e into pytorch:main Sep 2, 2025
56 of 79 checks passed

ji-huazhong deleted the fix branch September 3, 2025 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues#1425

[BugFix] Call synchronization when using the td.to("cpu") operation on third-party devices to avoid potential precision issues#1425
vmoens merged 2 commits intopytorch:mainfrom
ji-huazhong:fix

ji-huazhong commented Aug 26, 2025 •

edited

Loading

Uh oh!

vmoens left a comment

Uh oh!

Uh oh!

ji-huazhong commented Aug 26, 2025 •

edited

Loading

Uh oh!

vmoens left a comment

Uh oh!

Uh oh!

FightingZhen commented Sep 3, 2025

Uh oh!

vmoens commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ji-huazhong commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Types of changes

Checklist

Uh oh!

vmoens left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ji-huazhong commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmoens left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FightingZhen commented Sep 3, 2025

Uh oh!

vmoens commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ji-huazhong commented Aug 26, 2025 •

edited

Loading

ji-huazhong commented Aug 26, 2025 •

edited

Loading