Skip to content

Support async checkpointing through CheckpointManager#5697

Merged
jonb377 merged 4 commits intomasterfrom
jonbolin/async-chkpt
Oct 13, 2023
Merged

Support async checkpointing through CheckpointManager#5697
jonb377 merged 4 commits intomasterfrom
jonbolin/async-chkpt

Conversation

@jonb377
Copy link
Copy Markdown
Collaborator

@jonb377 jonb377 commented Oct 10, 2023

Support asynchronous checkpointing through the CheckpointManager interface. This will move the state_dict to CPU before starting the checkpoint, which unblocks the calling thread.

This PR depends on #5693 for synchronous checkpointing functionality.

Comment thread torch_xla/experimental/distributed_checkpoint/manager.py Outdated
Comment thread torch_xla/experimental/distributed_checkpoint/manager.py Outdated
Comment thread test/spmd/test_xla_distributed_checkpoint.py Outdated
Comment thread torch_xla/experimental/distributed_checkpoint/manager.py
Copy link
Copy Markdown
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some questions.

@jonb377 jonb377 force-pushed the jonbolin/async-chkpt branch 3 times, most recently from b9e1952 to 273ef2f Compare October 12, 2023 02:12
Base automatically changed from jonbolin/chkpt-manager to master October 13, 2023 00:14
@jonb377 jonb377 force-pushed the jonbolin/async-chkpt branch from ad02d56 to fdecad3 Compare October 13, 2023 00:21
Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jonb377 jonb377 merged commit 06ba6e9 into master Oct 13, 2023
@jonb377 jonb377 deleted the jonbolin/async-chkpt branch October 13, 2023 20:21
zpcore pushed a commit that referenced this pull request Oct 19, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
ghpvnist pushed a commit to ghpvnist/pytorch-xla that referenced this pull request Oct 31, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
@jonb377
Copy link
Copy Markdown
Collaborator Author

jonb377 commented Nov 10, 2023

cc @wz337

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants