Skip to content

Conversation

@adammoody
Copy link
Owner

@adammoody adammoody commented Sep 20, 2022

This integrates the Scalable Checkpoint / Restart library into the DeepSpeed checkpoint path. The major changes are:

  • add start / complete calls to declare the start and end of each checkpoint phase,
  • avoid creating any checkpoint directories when using SCR (SCR will create those when needed),
  • rely on SCR to record and report the tag name for a restart,
  • register each checkpoint file and acquire a temporary path to use when writing the file.

@adammoody adammoody changed the base branch from layerckpt to master October 26, 2022 16:44
@adammoody adammoody changed the base branch from master to layerckpt October 26, 2022 16:45
@adammoody adammoody changed the base branch from layerckpt to master October 26, 2022 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants