[DCP] OSS Zero Overhead Checkpointing Implementation by Saiteja64 · Pull Request #156207 · pytorch/pytorch

Saiteja64 · 2025-06-17T18:37:10Z

Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing

Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287

Differential Revision: D72391401

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-06-17T18:37:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156207

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 040df63 with merge base 2eb744c ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-06-17T18:37:20Z

This pull request was exported from Phabricator. Differential Revision: D72391401

vadimkantorov · 2025-06-17T20:53:26Z

Does the async DCP also work/practical for sending checkpoints directly to s3? (I guess using https://github.com/awslabs/s3-connector-for-pytorch ? or maybe is there some native support?)

fegin

A general question, what if users mix async_save() with save(), does this PR handle this case?

torch/distributed/_pin_memory_utils.py

fegin · 2025-06-17T22:16:25Z

torch/distributed/_pin_memory_utils.py

Should we have a unittest for this file if possible?

see #155192, this is now added torch.cuda. I think you can remove these changes as that PR has also updated state_dict_utils

torch/distributed/_state_dict_utils.py

fegin · 2025-06-17T22:23:21Z

torch/distributed/checkpoint/state_dict_saver.py

Probably not a good idea to ask users to selectively call close(). I would suggest that users should always call close(). Also if this is a public API, the docstring should follow the template. You can check other docstring.

Should we also wait for the last async_save inside this API as well?

I think we should just leave that to the users. In general, I want to limit global training state within DCP as much as possible.

Can we cache the state in AsyncStager and let user manage the lifetime of async stager? this will allow user to init async stager and destroy as needed. this can be done on every checkpoint or at the end of the job as the user sees fit.

Now, close() method is out of context. It is not on any resource/obj. It is hard for users to understand what close means and why they have to call it.

torch/distributed/checkpoint/state_dict_saver.py

fegin · 2025-06-17T22:32:10Z

@vadimkantorov It should work. The async_save logic and the underlying storage are decoupled.

facebook-github-bot · 2025-06-18T00:02:24Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-18T00:03:39Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-18T00:27:38Z

This pull request was exported from Phabricator. Differential Revision: D72391401

fegin

Overall, looks good. We should remove set logging level. Also, we should have at least one unittest for this feature.

fegin · 2025-06-18T04:44:52Z

torch/distributed/checkpoint/state_dict_saver.py

It will be nice that both paths return AsyncSaveResponse -- one path has only upload_future and another path has both. But I understand that this will break BC. Not sure if we can do some tricks, like making AsyncSaveResponse inherit from Future. Just a thought, not necessarily work.

yeah it's unfortunate but don't think there is a clean way to do this :/

fegin · 2025-06-18T04:47:24Z

torch/distributed/_state_dict_utils.py

We should remove this line. Users should be able to control the logging level, not the module.

vadimkantorov · 2025-06-18T07:07:14Z

Oh nice :) might be good to demonstrate practical recipe with s3 checkpointing in torchtitan

Also a typical issue with checkpointing is need to prune / delete existing checkpoints to save space. Does torchtitan provide any built-in policies to prune existing checkpoints? E.g. keep rare regular checkpoints + K best checkpoints + several latest checkpoints (and all this must interface with s3, as keeping locally many checkpoints of large models is not feasible...)

Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401

facebook-github-bot · 2025-06-18T17:00:11Z

This pull request was exported from Phabricator. Differential Revision: D72391401

teja-rao

I added a few comments but i think i have a fundamental question: Should we support zero overhead copy in async stager implementation instead in the legacy staging path?

teja-rao · 2025-06-20T17:03:51Z

torch/distributed/_pin_memory_utils.py

see #155192, this is now added torch.cuda. I think you can remove these changes as that PR has also updated state_dict_utils

teja-rao · 2025-06-20T17:05:07Z

torch/distributed/_state_dict_utils.py

-            _iterate_state_dict(
+        ret = []
+        for idx, v in enumerate(iter_object):
+            obj = _iterate_state_dict(


why change this? if it is not needed, can we revert it?

teja-rao · 2025-06-20T17:06:00Z

torch/distributed/_state_dict_utils.py

+                obj = _iterate_state_dict(
+                    value,
+                    sharded_tensor_func,
+                    dtensor_func,
+                    tensor_func,
+                    pg=pg,
+                    device=device,
+                    cpu_offload=cpu_offload,
+                    companion_obj=(
+                        companion_obj[key] if companion_obj is not None else None
+                    ),
+                    ranks_only=ranks_only,
+                    type_check=type_check,
+                    non_blocking=non_blocking,
+                )
+                ret[key] = obj


nit: ret[key] = _iterate_state_dict(...)

teja-rao · 2025-06-20T17:08:36Z

torch/distributed/checkpoint/_async_thread_executor.py

+    )
+
+
 class _ThreadBasedAsyncCheckpointExecutor(_AsyncCheckpointExecutor):


unrelated to this PR: but we should deprecate this in favor of async_process_executor.

teja-rao · 2025-06-20T17:09:12Z

torch/distributed/checkpoint/state_dict_saver.py

+from torch.distributed.checkpoint.metadata import Metadata
 from torch.distributed.checkpoint.planner import SavePlan, SavePlanner
 from torch.distributed.checkpoint.staging import AsyncStager
+from torch.distributed.checkpoint.staging import AsyncStager


repeated, remove?

teja-rao · 2025-06-20T17:16:56Z

torch/distributed/checkpoint/state_dict_saver.py

-    executor: _AsyncCheckpointExecutor = (
+    def stage_state_dict() -> Future[STATE_DICT_TYPE]:
+        staging_executor = ThreadPoolExecutor(max_workers=1)
+        if isinstance(storage_writer, AsyncStager) and not use_default_staging:


Suggested change

if isinstance(storage_writer, AsyncStager) and not use_default_staging:

if storage_writer is not None and isinstance(storage_writer, AsyncStager):

teja-rao · 2025-06-20T17:17:23Z

torch/distributed/checkpoint/state_dict_saver.py

+    use_default_staging = False
+    if storage_writer is None:
+        use_default_staging = True


remove? see suggestion on L321

teja-rao · 2025-06-20T17:24:22Z

torch/distributed/checkpoint/state_dict_saver.py

Can we cache the state in AsyncStager and let user manage the lifetime of async stager? this will allow user to init async stager and destroy as needed. this can be done on every checkpoint or at the end of the job as the user sees fit.

Now, close() method is out of context. It is not on any resource/obj. It is hard for users to understand what close means and why they have to call it.

teja-rao · 2025-06-20T17:27:28Z

torch/distributed/checkpoint/state_dict_saver.py

+        if isinstance(storage_writer, AsyncStager) and not use_default_staging:
+            staging_future = staging_executor.submit(storage_writer.stage, state_dict)
+        else:
+            # provides bwc for storage_writers not implementing AsyncStager


do we need to handle this case? can we ask user to implement async stager if they need to use zero-copy? I think it is simpler to support and cleaner from API point of view.

teja-rao · 2025-06-20T17:28:51Z

torch/distributed/checkpoint/state_dict_saver.py

+            if not block_on_staging:
+                global _CACHED_STATE_DICT
+                if not _CACHED_STATE_DICT:
+                    _CACHED_STATE_DICT = _create_cpu_state_dict(state_dict, pin_memory=True, share_memory=True) 


pin_memory and share_memory needs to be controlled options, so the user has a choice to disable them as they come with drawbacks that might not work for every model or every system.

facebook-github-bot · 2025-06-20T22:38:52Z

This pull request was exported from Phabricator. Differential Revision: D72391401

teja-rao

overall pr looks good to me after updates. I will stamp the PR once the testing is complete and CI shows green! Thank you for making the changes.

teja-rao · 2025-06-20T23:19:18Z

torch/distributed/checkpoint/_async_executor.py

should this be STATE_DICT_TYPE | Future[STATE_DICT_TYPE]?

nit: i see we are converting the state_dict_type to future to pass in which is okay but i think it is more readable to just keep passing in the Union of state_dict_type and future.

That makes sense. I was originally thinking in the long term, stage would only return a future (would work for sync as well) but I don't see that happening because it would introduce breaking changes and would be hard to cleanly deprecate the old return type

teja-rao · 2025-06-20T23:25:19Z

torch/distributed/checkpoint/state_dict_saver.py

nit:

Suggested change

if async_stager is None:

if (storage_writer is None or not isinstance(storage_writer, AsyncStager)):

async_stager = DefaultStager(StagingOptions(not block_on_staging, not block_on_staging, not block_on_staging, not block_on_staging))

elif isinstance(storage_writer, AsyncStager):

# bwc with old storage_writers

async_stager = storage_writer

if async_stager is None:

if (storage_writer is not None and isinstance(storage_writer, AsyncStager)):

# bwc with old storage_writers

async_stager = storage_writer

else:

async_stager = DefaultStager(StagingOptions(not block_on_staging, not block_on_staging, not block_on_staging, not block_on_staging))

teja-rao · 2025-06-20T23:32:44Z

torch/distributed/checkpoint/_async_thread_executor.py

is save method still used for sync save? why not change it to support union?

nit: what do you think about this? we can eliminate the save_wrapper and add the if instance check in the save method?

facebook-github-bot · 2025-06-26T19:27:35Z

This pull request was exported from Phabricator. Differential Revision: D72391401

teja-rao

sending back for updating docs and for consideration on nits.

teja-rao · 2025-06-26T23:40:53Z

torch/distributed/checkpoint/_checkpointer.py

is this assert needed? mypy typechecks should catch if you arent returning a future?

Without this, we introduce a linter error because async_save either returns a Tuple of staging_future/upload_future or an upload future now.

teja-rao · 2025-06-26T23:41:38Z

torch/distributed/checkpoint/staging.py

clean up/update CheckpointStager? i think these are from dcp evolution work..

teja-rao · 2025-06-26T23:45:52Z

torch/distributed/checkpoint/staging.py

i think we do not want users to create a stager each time. stager caches the storages, may be this needs an update.

teja-rao · 2025-06-26T23:48:13Z

torch/distributed/checkpoint/staging.py

throw an exception and suggest synchronizing using the future or call staging_future.result() here?

facebook-github-bot · 2025-06-27T00:11:31Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-27T00:13:27Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-27T00:17:00Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-27T17:46:57Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-27T17:47:38Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-27T17:49:56Z

This pull request was exported from Phabricator. Differential Revision: D72391401

teja-rao

approving to unblock, please fix the mypy error before landing.

facebook-github-bot · 2025-06-27T22:25:01Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T00:49:53Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T03:32:11Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T03:36:17Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T04:19:27Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T04:54:42Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T05:34:20Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T05:39:38Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T06:56:09Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T07:05:20Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T07:10:41Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T08:54:37Z

This pull request was exported from Phabricator. Differential Revision: D72391401

Summary: X-link: meta-pytorch/tnt#1010 This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Add new UT Reviewed By: diego-urgell Differential Revision: D72391401

facebook-github-bot · 2025-06-28T15:27:16Z

This pull request was exported from Phabricator. Differential Revision: D72391401

facebook-github-bot · 2025-06-28T18:57:53Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorch-bot · 2025-06-28T18:57:57Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

Saiteja64 · 2025-06-29T03:05:46Z

@pytorchbot merge

pytorch-bot · 2025-06-29T03:05:50Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

Saiteja64 · 2025-06-29T03:12:24Z

@pytorchbot merge

pytorchmergebot · 2025-06-29T03:14:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Jun 17, 2025

facebook-github-bot added the fb-exported label Jun 17, 2025

fegin previously requested changes Jun 17, 2025

View reviewed changes

Saiteja64 force-pushed the export-D72391401 branch from ac1c1f4 to fd4bf88 Compare June 17, 2025 23:59

facebook-github-bot force-pushed the export-D72391401 branch from fd4bf88 to aa0e4b3 Compare June 18, 2025 00:02

Saiteja64 force-pushed the export-D72391401 branch from aa0e4b3 to 3bb6615 Compare June 18, 2025 00:03

facebook-github-bot force-pushed the export-D72391401 branch from 3bb6615 to 3d5128c Compare June 18, 2025 00:27

fegin reviewed Jun 18, 2025

View reviewed changes

facebook-github-bot force-pushed the export-D72391401 branch from 3d5128c to 980c3c9 Compare June 18, 2025 16:59

teja-rao reviewed Jun 20, 2025

View reviewed changes

facebook-github-bot force-pushed the export-D72391401 branch from 980c3c9 to c8bc4e5 Compare June 20, 2025 22:38

teja-rao reviewed Jun 20, 2025

View reviewed changes

teja-rao suggested changes Jun 26, 2025

View reviewed changes

teja-rao approved these changes Jun 27, 2025

View reviewed changes

sawaraken bot mentioned this pull request Jun 29, 2025

PyTorch Implements Zero-Overhead Checkpointing / PyTorchにゼロオーバーヘッド・チェックポインティングが実装 xhiroga/news#626

Open

		)


		class _ThreadBasedAsyncCheckpointExecutor(_AsyncCheckpointExecutor):

	if isinstance(storage_writer, AsyncStager) and not use_default_staging:
	if storage_writer is not None and isinstance(storage_writer, AsyncStager):

-    if async_stager is None:
-        if (storage_writer is None or not isinstance(storage_writer, AsyncStager)):
-            async_stager = DefaultStager(StagingOptions(not block_on_staging, not block_on_staging, not block_on_staging, not block_on_staging))
-        elif isinstance(storage_writer, AsyncStager):
-            # bwc with old storage_writers
-            async_stager = storage_writer
+    if async_stager is None:
+        if (storage_writer is not None and isinstance(storage_writer, AsyncStager)):
+                    # bwc with old storage_writers
+                   async_stager = storage_writer
+        else:
+                    async_stager = DefaultStager(StagingOptions(not block_on_staging, not block_on_staging, not block_on_staging, not block_on_staging))

Conversation

Saiteja64 commented Jun 17, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156207

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

facebook-github-bot commented Jun 17, 2025

Uh oh!

vadimkantorov commented Jun 17, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vadimkantorov commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 18, 2025

Uh oh!

teja-rao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Saiteja64 commented Jun 17, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 17, 2025 •

edited

Loading

fegin commented Jun 17, 2025 •

edited

Loading

vadimkantorov commented Jun 18, 2025 •

edited

Loading

Saiteja64 Jun 20, 2025 •

edited

Loading

teja-rao Jun 26, 2025 •

edited

Loading

teja-rao Jun 26, 2025 •

edited

Loading