[DCP][OSS] Rank local checkpointing in DCP without collectives by saumishr · Pull Request #147758 · pytorch/pytorch

saumishr · 2025-02-24T19:26:19Z

Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @LucasLLC @pradeepfn @kwen2501 @c-p-i-o @MeetVadakkanchery @mhorowitz @ekr0

pytorch-bot · 2025-02-24T19:26:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147758

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 147fb0e with merge base 8eee08d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-04-06T20:19:55Z

This pull request was exported from Phabricator. Differential Revision: D70112642

Summary: X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642

Summary: Pull Request resolved: meta-pytorch#991 X-link: pytorch/pytorch#147758 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642

Summary: X-link: meta-pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Differential Revision: D70112642

facebook-github-bot · 2025-04-24T15:39:00Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-04-30T17:57:36Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-06-26T15:39:17Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-06-26T15:42:14Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-25T19:51:34Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-30T04:17:35Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-30T04:30:01Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-30T04:57:40Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-30T05:26:20Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-31T01:27:52Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-31T01:41:17Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-07-31T17:07:15Z

This pull request was exported from Phabricator. Differential Revision: D70112642

meetv18

Overall save path LGTM, but I have some concerns on load. Happy to approve after resolved. Thanks!

meetv18 · 2025-08-01T19:59:25Z

torch/distributed/checkpoint/state_dict_loader.py

+        nonlocal use_collectives
+        nonlocal metadata
+
+        if "kwargs" in inspect.signature(storage_reader.read_metadata).parameters:
+            try:
+                metadata = storage_reader.read_metadata(rank=distW.rank)  # noqa: F841
+
+                if metadata:
+                    use_collectives = False
+                    logger.info(
+                        "Rank local metadata is found. Using no rank coordination for checkpoint loading."
+                    )
+            except Exception:
+                logger.info(
+                    "Rank local metadata is not found. Falling back to global metadata."
+                )
+
+        if use_collectives:
+            metadata = storage_reader.read_metadata()


I am bit unsure about this. Consider a user's impl where every rank reads the global metadata from storage, but still has some global planning to do, for e.g. to assign read items to only one rank and then ask it to broadcast to others. This would fundamentally break that logic? Making this change non-bwc?

I believe the backward compatibility within DCP API is for its own behavior. Currently it checks for both, if the rank local metadata is present then it assumes that the no-collective mode is on. If not, then it falls back to the global metadata. If no metadata is found then it runs into the same exception as it does today. For a user who has complex interaction model, will need to take care of the backward compatibility as well in their own storage components. The read_metadata API allows someone to specify the rank to read a rank local metadata or a global metadata. Users can use that API to customize the behavior.

Oh yes that makes sense to me. L245 is doing exactly that, thank you for the clarification!

meetv18 · 2025-08-01T20:02:56Z

torch/distributed/checkpoint/default_planner.py

        # Check whether combined chunk cover the whole tensor
        tensor_volume = reduce(operator.mul, value.size, 1)
-        if chunks_volume != tensor_volume:
+        if len(global_plan) > 1 and chunks_volume != tensor_volume:


n00b q: why do we need this?

Tensor volume check makes sense only in the global context. When every rank is doing its own planning, this check doesn't have much value. I plan to refactor the plan validation into local and global validation and then it would become cleaner.

facebook-github-bot · 2025-08-12T21:29:42Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-08-13T00:51:06Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-08-13T02:11:32Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-08-13T05:42:40Z

This pull request was exported from Phabricator. Differential Revision: D70112642

facebook-github-bot · 2025-08-13T06:07:41Z

This pull request was exported from Phabricator. Differential Revision: D70112642

…ch#147758) Summary: X-link: meta-pytorch/tnt#991 Context: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Test Plan: E2E UTs Save and load test with internal DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-lv5d7qcfmnqzkd Save and load test with OSS DCP components: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-z1vz46vkkgtcld https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-textray-pretrain_mlm-njvvbn07rv5ckd Reviewed By: meetv18 Differential Revision: D70112642

facebook-github-bot · 2025-08-13T06:36:39Z

This pull request was exported from Phabricator. Differential Revision: D70112642

saumishr · 2025-08-13T14:46:24Z

@pytorchmergebot merge

pytorchmergebot · 2025-08-13T14:48:31Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

facebook-github-bot · 2025-08-13T16:12:24Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-08-13T16:14:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 24, 2025

facebook-github-bot added the fb-exported label Feb 24, 2025

saumishr force-pushed the export-D70112642 branch from 3521908 to dbc79b6 Compare February 25, 2025 02:49

meetv18 added the oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. label Feb 25, 2025

janeyx99 added release notes: distributed (checkpoint) and removed module: distributed_checkpoint labels Mar 3, 2025

gkroiz added a commit to gkroiz/pytorch that referenced this pull request Mar 9, 2025

Changes from #pytorch#147758

c7c2830

saumishr force-pushed the export-D70112642 branch from dbc79b6 to 1f25f09 Compare April 2, 2025 06:50

saumishr force-pushed the export-D70112642 branch 2 times, most recently from 36ee9c7 to eb8d1a7 Compare April 2, 2025 13:56

saumishr force-pushed the export-D70112642 branch from eb8d1a7 to 1329b96 Compare April 2, 2025 16:36

saumishr force-pushed the export-D70112642 branch from 1329b96 to 4d5cde1 Compare April 4, 2025 05:01

saumishr force-pushed the export-D70112642 branch from 4d5cde1 to 54c8484 Compare April 6, 2025 15:21

saumishr force-pushed the export-D70112642 branch 2 times, most recently from 78dcad0 to d3c31ff Compare April 6, 2025 17:56

saumishr force-pushed the export-D70112642 branch from d3c31ff to f62e067 Compare April 6, 2025 20:20

pytorch deleted a comment from facebook-github-bot Apr 6, 2025

saumishr mentioned this pull request Apr 18, 2025

Rank local checkpointing in DCP without collectives meta-pytorch/tnt#991

Closed

saumishr force-pushed the export-D70112642 branch from f3eeb56 to 79f9e79 Compare April 24, 2025 15:38

meetv18 reviewed Aug 1, 2025

View reviewed changes

meetv18 approved these changes Aug 1, 2025

View reviewed changes

Conversation

saumishr commented Feb 24, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147758

✅ No Failures

Uh oh!

facebook-github-bot commented Apr 6, 2025

Uh oh!

facebook-github-bot commented Apr 24, 2025

Uh oh!

facebook-github-bot commented Apr 30, 2025

Uh oh!

facebook-github-bot commented Jun 26, 2025

Uh oh!

facebook-github-bot commented Jun 26, 2025

Uh oh!

facebook-github-bot commented Jul 25, 2025

Uh oh!

facebook-github-bot commented Jul 30, 2025

Uh oh!

facebook-github-bot commented Jul 30, 2025

Uh oh!

facebook-github-bot commented Jul 30, 2025

Uh oh!

facebook-github-bot commented Jul 30, 2025

Uh oh!

facebook-github-bot commented Jul 31, 2025

Uh oh!

facebook-github-bot commented Jul 31, 2025

Uh oh!

facebook-github-bot commented Jul 31, 2025

Uh oh!

meetv18 left a comment

Choose a reason for hiding this comment

Uh oh!

meetv18 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

saumishr Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

meetv18 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

meetv18 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

saumishr Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 12, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

saumishr commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Merge failed

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

saumishr commented Feb 24, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 24, 2025 •

edited

Loading