Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying#154369
Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying#154369ccchow wants to merge 8 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154369
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 48d0a0e with merge base 7cda401 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
weifengpy
left a comment
There was a problem hiding this comment.
thanks for the fix! are you interested in adding a unit test to cover this case? If not, I can follow up after your landing
|
Sure I’ll add UT for this.
Best Regards,
*Lei*
…On Wed, May 28, 2025 at 22:52 Wei (Will) Feng ***@***.***> wrote:
***@***.**** approved this pull request.
thanks for the fix! are you interested in adding a unit test to cover this
case? If not, I can follow up after your landing
—
Reply to this email directly, view it on GitHub
<#154369 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFFGMKIBP3BCZTJUBGRPXD3A2ODBAVCNFSM6AAAAAB56FLWO2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQNZXGE3TANJQGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Added a unit test validating single‑GPU writeback for NO_SHARD configurations
|
Hi @weifengpy , I added a case to test_fsdp_flatten_params.py. Please help review. Passed in local |
Add comments
|
Also confirmed that this fix resolves the repro in the original issue #151223 |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
Removed trailing spaces reported by lintrunner. @weifengpy could you help kick off merge again? |
|
Kindly ping. |
thanks for the reminder. triggering CI again |
|
Ran lintrunner -a torch/distributed/fsdp/_flat_param.py to resolve check errors. |
thanks. approved running cI again |
|
Not sure what's wrong with the checks but they probably need restart. |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 4 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud |
|
Seems still blocked due to PR workflow being cancelled. |
|
I'm wondering how to resolve the pending workflows to unblock this PR. |
kicking it off again. if it fails, would it be possible to rebase on latest trunk? it’s 22 days old. if it succeeded, no need to rebase |
|
Thanks @weifengpy . Rebased my branch and should be ready to merge. |
running CI again |
|
@pytorchmergebot merge |
|
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
|
Hi @weifengpy , the last CI got 100% passed but the PR is still blocked with error 'Merging is blocked |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Thanks @weifengpy for the help on this PR! |
|
appreicate your persistence to push to the very last mile @ccchow |
Fixes #151223
Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.
Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k