Support synchronous saving and loading in CheckpointManager#5693
Merged
Support synchronous saving and loading in CheckpointManager#5693
Conversation
jonb377
commented
Oct 10, 2023
| from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE | ||
|
|
||
| # TODO(jonbolin): Import path will change | ||
| from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter |
Collaborator
Author
There was a problem hiding this comment.
The import path will change when the API becomes public in the upstream. @alanwaketan @yeounoh do you have any thoughts on how to handle this?
Collaborator
There was a problem hiding this comment.
It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.
alanwaketan
reviewed
Oct 11, 2023
| from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE | ||
|
|
||
| # TODO(jonbolin): Import path will change | ||
| from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter |
Collaborator
There was a problem hiding this comment.
It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.
yeounoh
reviewed
Oct 11, 2023
alanwaketan
reviewed
Oct 11, 2023
152c9bf to
4430877
Compare
yeounoh
reviewed
Oct 12, 2023
Collaborator
Author
|
Thanks @yeounoh and @alanwaketan for the review! I'll merge after TPU CI. |
zpcore
pushed a commit
that referenced
this pull request
Oct 19, 2023
* Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
ghpvnist
pushed a commit
to ghpvnist/pytorch-xla
that referenced
this pull request
Oct 31, 2023
…5693) * Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
mbzomowski
pushed a commit
to mbzomowski-test-org/xla
that referenced
this pull request
Nov 16, 2023
…5693) * Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
chunnienc
pushed a commit
to chunnienc/xla
that referenced
this pull request
Dec 14, 2023
…5693) * Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
golechwierowicz
pushed a commit
that referenced
this pull request
Jan 12, 2024
* Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
bhavya01
pushed a commit
that referenced
this pull request
Apr 22, 2024
* Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the initial functionality for CheckpointManager to manage synchronously taking checkpoints, restoring checkpoints, and managing how many checkpoints it tracks.