[dcp][hf] Fix multi-rank consolidation for no files to process case#160660
[dcp][hf] Fix multi-rank consolidation for no files to process case#160660ankitageorge wants to merge 1 commit intopytorch:mainfrom
Conversation
Summary: If there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work. Test Plan: tested this case on a job e2e added a unit test Rollback Plan: Differential Revision: D80273616
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160660
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled JobAs of commit 47ff1ff with merge base 5665dc9 ( CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D80273616 |
| } | ||
|
|
||
| if not filtered_mapping: | ||
| logger.info("Rank %d: No files to process, exiting early", rank) |
There was a problem hiding this comment.
We may want to keep this log line for debugging purpose.
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot merge -f |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchmergebot merge -f forcing because failure was a timeout which resulted in cancellation |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchmergebot merge -f 'forcing because failure was a timeout which resulted in cancellation' |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ytorch#160660) Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work. Test Plan: tested this case on a job e2e added a unit test Rollback Plan: Differential Revision: D80273616 Pull Request resolved: pytorch#160660 Approved by: https://github.com/sibuachu
Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.
Test Plan:
tested this case on a job e2e
added a unit test
Rollback Plan:
Differential Revision: D80273616
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta