Skip to content

Conversation

@trajepl
Copy link
Contributor

@trajepl trajepl commented Jul 11, 2022

No description provided.

@trajepl
Copy link
Contributor Author

trajepl commented Jul 11, 2022

It feels like to me that the NebulaCheckpointEngine should be part of the Nebula library itself. Is it possible to make Nebula APIs align more with torch save/load APIs? Then on the DeepSpeed or any other Nebular user will require less code change to support this feature.

Currently, it is hard to make nebula save/load api totally same with torch save/load, as nebula relys on the tag to build up the concept of checkpoint to manager the saved files.

Instead of modifying the core deepspeed.intiialize() API can we instead add Nebula config options into the deepspeed config json? We try very hard not to add additional params to our initialize API. In the near future we are going to simplify these args even further.

Updated in latest commit.

This warning will be printed for all ranks for anyone who doesn't have torch_nebula installed (e.g., 256 gpu -> 256 prints). Can we ignore this import warning and raise an explicit error if and only if a user attempts to enable nebula?

Updated in latest commit.

@tjruwase tjruwase merged commit e669aaf into deepspeedai:master Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants