Skip to content

Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying#154369

Closed
ccchow wants to merge 8 commits intopytorch:mainfrom
ccchow:main
Closed

Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying#154369
ccchow wants to merge 8 commits intopytorch:mainfrom
ccchow:main

Conversation

@ccchow
Copy link
Contributor

@ccchow ccchow commented May 26, 2025

Fixes #151223

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented May 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154369

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 48d0a0e with merge base 7cda401 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 26, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 26, 2025
@ccchow ccchow changed the title Fix: Ensured writeback handles NO_SHARD correctly by flattening tensors before copying Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying May 26, 2025
@bdhirsh bdhirsh requested a review from weifengpy May 28, 2025 21:21
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 28, 2025
Copy link
Contributor

@weifengpy weifengpy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix! are you interested in adding a unit test to cover this case? If not, I can follow up after your landing

@ccchow
Copy link
Contributor Author

ccchow commented May 29, 2025 via email

Added a unit test validating single‑GPU writeback for NO_SHARD configurations
@ccchow
Copy link
Contributor Author

ccchow commented May 30, 2025

Hi @weifengpy , I added a case to test_fsdp_flatten_params.py. Please help review. Passed in local

@ccchow
Copy link
Contributor Author

ccchow commented May 30, 2025

Also confirmed that this fix resolves the repro in the original issue #151223

@weifengpy
Copy link
Contributor

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 1, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 1, 2025
@ccchow
Copy link
Contributor Author

ccchow commented Jun 1, 2025

Removed trailing spaces reported by lintrunner. @weifengpy could you help kick off merge again?

@ccchow
Copy link
Contributor Author

ccchow commented Jun 5, 2025

Kindly ping.

@weifengpy
Copy link
Contributor

Kindly ping.

thanks for the reminder. triggering CI again

@ccchow
Copy link
Contributor Author

ccchow commented Jun 5, 2025

Ran lintrunner -a torch/distributed/fsdp/_flat_param.py to resolve check errors.

@weifengpy
Copy link
Contributor

weifengpy commented Jun 5, 2025

Ran lintrunner -a torch/distributed/fsdp/_flat_param.py to resolve check errors.

thanks. approved running cI again

@ccchow
Copy link
Contributor Author

ccchow commented Jun 6, 2025

Not sure what's wrong with the checks but they probably need restart.

@weifengpy
Copy link
Contributor

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 7, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 4 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ccchow
Copy link
Contributor Author

ccchow commented Jun 8, 2025

Seems still blocked due to PR workflow being cancelled.

@ccchow
Copy link
Contributor Author

ccchow commented Jun 17, 2025

I'm wondering how to resolve the pending workflows to unblock this PR.

@weifengpy
Copy link
Contributor

I'm wondering how to resolve the pending workflows to unblock this PR.

kicking it off again.

if it fails, would it be possible to rebase on latest trunk? it’s 22 days old. if it succeeded, no need to rebase

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 22, 2025
@ccchow
Copy link
Contributor Author

ccchow commented Jul 1, 2025

Thanks @weifengpy . Rebased my branch and should be ready to merge.

@weifengpy
Copy link
Contributor

Thanks @weifengpy . Rebased my branch and should be ready to merge.

running CI again

@ccchow
Copy link
Contributor Author

ccchow commented Jul 6, 2025

@pytorchmergebot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 6, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@ccchow
Copy link
Contributor Author

ccchow commented Jul 6, 2025

Hi @weifengpy , the last CI got 100% passed but the PR is still blocked with error 'Merging is blocked
You're not authorized to push to this branch.'. I rebased again and appended a merge command. Feel free to trigger CI again and let me know if anything is needed.

@weifengpy
Copy link
Contributor

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 6, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ccchow
Copy link
Contributor Author

ccchow commented Jul 8, 2025

Thanks @weifengpy for the help on this PR!

@weifengpy
Copy link
Contributor

appreicate your persistence to push to the very last mile @ccchow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FSDP] Cannot writeback when the parameter shape changes

5 participants