Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying by ccchow · Pull Request #154369 · pytorch/pytorch

ccchow · 2025-05-26T18:09:20Z

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

…fore copying.

pytorch-bot · 2025-05-26T18:09:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154369

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 48d0a0e with merge base 7cda401 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-05-26T18:09:25Z

The committers listed above are authorized under a signed CLA.

✅ login: ccchow / name: Lei (e2a7c7d, 2ceea7c, 48d0a0e, d58d962, f43a2ca, c106000, 2347d65, eb6453e)

weifengpy

thanks for the fix! are you interested in adding a unit test to cover this case? If not, I can follow up after your landing

ccchow · 2025-05-29T05:56:42Z

Sure I’ll add UT for this. Best Regards, *Lei*

…

On Wed, May 28, 2025 at 22:52 Wei (Will) Feng ***@***.***> wrote: ***@***.**** approved this pull request. thanks for the fix! are you interested in adding a unit test to cover this case? If not, I can follow up after your landing — Reply to this email directly, view it on GitHub <#154369 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFFGMKIBP3BCZTJUBGRPXD3A2ODBAVCNFSM6AAAAAB56FLWO2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQNZXGE3TANJQGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Added a unit test validating single‑GPU writeback for NO_SHARD configurations

ccchow · 2025-05-30T05:22:06Z

Hi @weifengpy , I added a case to test_fsdp_flatten_params.py. Please help review. Passed in local

test/distributed/fsdp/test_fsdp_flatten_params.py

Add comments

ccchow · 2025-05-30T22:09:28Z

Also confirmed that this fix resolves the repro in the original issue #151223

weifengpy · 2025-06-01T22:40:15Z

@pytorchmergebot merge

pytorchmergebot · 2025-06-01T22:42:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-01T23:03:32Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-noclang / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

ccchow · 2025-06-01T23:11:25Z

Removed trailing spaces reported by lintrunner. @weifengpy could you help kick off merge again?

ccchow · 2025-06-05T05:28:14Z

Kindly ping.

weifengpy · 2025-06-05T06:05:47Z

Kindly ping.

thanks for the reminder. triggering CI again

ccchow · 2025-06-05T18:13:47Z

Ran lintrunner -a torch/distributed/fsdp/_flat_param.py to resolve check errors.

weifengpy · 2025-06-05T18:17:54Z

Ran lintrunner -a torch/distributed/fsdp/_flat_param.py to resolve check errors.

thanks. approved running cI again

ccchow · 2025-06-06T21:19:00Z

Not sure what's wrong with the checks but they probably need restart.

weifengpy · 2025-06-07T04:05:17Z

@pytorchmergebot merge

pytorchmergebot · 2025-06-07T04:07:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-07T04:07:20Z

Merge failed

Reason: 4 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

ccchow · 2025-06-08T21:17:33Z

Seems still blocked due to PR workflow being cancelled.

ccchow · 2025-06-17T21:03:04Z

I'm wondering how to resolve the pending workflows to unblock this PR.

weifengpy · 2025-06-18T07:24:28Z

I'm wondering how to resolve the pending workflows to unblock this PR.

kicking it off again.

if it fails, would it be possible to rebase on latest trunk? it’s 22 days old. if it succeeded, no need to rebase

ccchow · 2025-07-01T03:20:11Z

Thanks @weifengpy . Rebased my branch and should be ready to merge.

weifengpy · 2025-07-01T20:15:33Z

Thanks @weifengpy . Rebased my branch and should be ready to merge.

running CI again

ccchow · 2025-07-06T02:38:59Z

@pytorchmergebot merge

pytorch-bot · 2025-07-06T02:39:03Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

ccchow · 2025-07-06T02:44:45Z

Hi @weifengpy , the last CI got 100% passed but the PR is still blocked with error 'Merging is blocked
You're not authorized to push to this branch.'. I rebased again and appended a merge command. Feel free to trigger CI again and let me know if anything is needed.

weifengpy · 2025-07-06T06:02:37Z

@pytorchmergebot merge

pytorchmergebot · 2025-07-06T06:04:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ccchow · 2025-07-08T16:49:29Z

Thanks @weifengpy for the help on this PR!

weifengpy · 2025-07-08T18:24:21Z

appreicate your persistence to push to the very last mile @ccchow

Ensured writeback handles NO_SHARD correctly by flattening tensors be…

e2a7c7d

…fore copying.

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 26, 2025

ccchow changed the title ~~Fix: Ensured writeback handles NO_SHARD correctly by flattening tensors before copying~~ Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying May 26, 2025

pytorchbot added the open source label May 26, 2025

bdhirsh requested a review from weifengpy May 28, 2025 21:21

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 28, 2025

weifengpy approved these changes May 29, 2025

View reviewed changes

Update test_fsdp_flatten_params.py

2347d65

Added a unit test validating single‑GPU writeback for NO_SHARD configurations

weifengpy approved these changes May 30, 2025

View reviewed changes

test/distributed/fsdp/test_fsdp_flatten_params.py Show resolved Hide resolved

Update test_fsdp_flatten_params.py

f43a2ca

Add comments

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 1, 2025

pytorchmergebot added the merging label Jun 1, 2025

pytorchmergebot removed the merging label Jun 1, 2025

Update test_fsdp_flatten_params.py

d58d962

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 1, 2025

Add a newline for better readability in test_fsdp_flatten_params.py

c106000

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 7, 2025

pytorchmergebot added the merging label Jun 7, 2025

pytorchmergebot removed the merging label Jun 7, 2025

Merge branch 'pytorch:main' into main

2ceea7c

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 22, 2025

Merge branch 'pytorch:main' into main

eb6453e

Merge branch 'pytorch:main' into main

48d0a0e

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 6, 2025

pytorchmergebot added the merging label Jul 6, 2025

pytorchmergebot added the Merged label Jul 6, 2025

pytorchmergebot closed this in 2022588 Jul 6, 2025

pytorchmergebot removed the merging label Jul 6, 2025

Conversation

ccchow commented May 26, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154369

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weifengpy left a comment

Choose a reason for hiding this comment

Uh oh!

ccchow commented May 29, 2025 via email

Uh oh!

ccchow commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ccchow commented May 30, 2025

Uh oh!

weifengpy commented Jun 1, 2025

Uh oh!

pytorchmergebot commented Jun 1, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 1, 2025

Merge failed

Uh oh!

ccchow commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccchow commented Jun 5, 2025

Uh oh!

weifengpy commented Jun 5, 2025

Uh oh!

ccchow commented Jun 5, 2025

Uh oh!

weifengpy commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccchow commented Jun 6, 2025

Uh oh!

weifengpy commented Jun 7, 2025

Uh oh!

pytorchmergebot commented Jun 7, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 7, 2025

Merge failed

Uh oh!

ccchow commented Jun 8, 2025

Uh oh!

ccchow commented Jun 17, 2025

Uh oh!

weifengpy commented Jun 18, 2025

Uh oh!

ccchow commented Jul 1, 2025

Uh oh!

weifengpy commented Jul 1, 2025

Uh oh!

ccchow commented Jul 6, 2025

Uh oh!

pytorch-bot bot commented Jul 6, 2025

Uh oh!

ccchow commented Jul 6, 2025

Uh oh!

weifengpy commented Jul 6, 2025

Uh oh!

pytorchmergebot commented Jul 6, 2025

Merge started

Uh oh!

ccchow commented Jul 8, 2025

Uh oh!

weifengpy commented Jul 8, 2025

Uh oh!

Reviewers

ccchow commented May 26, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 26, 2025 •

edited

Loading

linux-foundation-easycla bot commented May 26, 2025 •

edited

Loading

ccchow commented May 30, 2025 •

edited

Loading

ccchow commented Jun 1, 2025 •

edited

Loading

weifengpy commented Jun 5, 2025 •

edited

Loading