Skip to content

Support synchronous saving and loading in CheckpointManager#5693

Merged
jonb377 merged 7 commits intomasterfrom
jonbolin/chkpt-manager
Oct 13, 2023
Merged

Support synchronous saving and loading in CheckpointManager#5693
jonb377 merged 7 commits intomasterfrom
jonbolin/chkpt-manager

Conversation

@jonb377
Copy link
Copy Markdown
Collaborator

@jonb377 jonb377 commented Oct 10, 2023

This PR adds the initial functionality for CheckpointManager to manage synchronously taking checkpoints, restoring checkpoints, and managing how many checkpoints it tracks.

@jonb377 jonb377 added the distributed SPMD and other distributed things. label Oct 10, 2023
@jonb377 jonb377 self-assigned this Oct 10, 2023
from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE

# TODO(jonbolin): Import path will change
from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import path will change when the API becomes public in the upstream. @alanwaketan @yeounoh do you have any thoughts on how to handle this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.

Comment thread test/spmd/test_xla_distributed_checkpoint.py
from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE

# TODO(jonbolin): Import path will change
from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.

Comment thread torch_xla/experimental/distributed_checkpoint/manager.py Outdated
Comment thread torch_xla/experimental/distributed_checkpoint/manager.py Outdated
Comment thread test/spmd/test_xla_distributed_checkpoint.py Outdated
Comment thread torch_xla/experimental/distributed_checkpoint/manager.py Outdated
@jonb377 jonb377 force-pushed the jonbolin/chkpt-manager branch from 152c9bf to 4430877 Compare October 12, 2023 00:37
Comment thread torch_xla/experimental/distributed_checkpoint/manager.py Outdated
Copy link
Copy Markdown
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jonb377
Copy link
Copy Markdown
Collaborator Author

jonb377 commented Oct 12, 2023

Thanks @yeounoh and @alanwaketan for the review! I'll merge after TPU CI.

@jonb377 jonb377 merged commit 2a45d0d into master Oct 13, 2023
@jonb377 jonb377 deleted the jonbolin/chkpt-manager branch October 13, 2023 00:14
zpcore pushed a commit that referenced this pull request Oct 19, 2023
* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
ghpvnist pushed a commit to ghpvnist/pytorch-xla that referenced this pull request Oct 31, 2023
…5693)

* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
…5693)

* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
…5693)

* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed SPMD and other distributed things.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants