Skip to content

[dcp][hf] Fix multi-rank consolidation for no files to process case#160660

Closed
ankitageorge wants to merge 1 commit intopytorch:mainfrom
ankitageorge:export-D80273616
Closed

[dcp][hf] Fix multi-rank consolidation for no files to process case#160660
ankitageorge wants to merge 1 commit intopytorch:mainfrom
ankitageorge:export-D80273616

Conversation

@ankitageorge
Copy link
Contributor

@ankitageorge ankitageorge commented Aug 14, 2025

Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.

Test Plan:
tested this case on a job e2e
added a unit test

Rollback Plan:

Differential Revision: D80273616

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

Summary: If there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.

Test Plan:
tested this case on a job e2e
added a unit test

Rollback Plan:

Differential Revision: D80273616
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160660

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit 47ff1ff with merge base 5665dc9 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Aug 14, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80273616

}

if not filtered_mapping:
logger.info("Rank %d: No files to process, exiting early", rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to keep this log line for debugging purpose.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 21, 2025
@ankitageorge
Copy link
Contributor Author

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Details for Dev Infra team Raised by workflow job

@ankitageorge
Copy link
Contributor Author

@pytorchmergebot merge -f

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot merge: error: argument -f/--force: expected one argument

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Try @pytorchbot --help for more info.

@ankitageorge
Copy link
Contributor Author

@pytorchmergebot merge -f forcing because failure was a timeout which resulted in cancellation

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: because failure was a timeout which resulted in cancellation

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@ankitageorge
Copy link
Contributor Author

@pytorchmergebot merge -f 'forcing because failure was a timeout which resulted in cancellation'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…ytorch#160660)

Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.

Test Plan:
tested this case on a job e2e
added a unit test

Rollback Plan:

Differential Revision: D80273616

Pull Request resolved: pytorch#160660
Approved by: https://github.com/sibuachu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants