Trajepl/nebula ckpt engine #2085

trajepl · 2022-07-11T08:26:32Z

No description provided.

Sync the official deeprank master into forked repo

trajepl · 2022-07-11T08:28:49Z

It feels like to me that the NebulaCheckpointEngine should be part of the Nebula library itself. Is it possible to make Nebula APIs align more with torch save/load APIs? Then on the DeepSpeed or any other Nebular user will require less code change to support this feature.

Currently, it is hard to make nebula save/load api totally same with torch save/load, as nebula relys on the tag to build up the concept of checkpoint to manager the saved files.

Instead of modifying the core deepspeed.intiialize() API can we instead add Nebula config options into the deepspeed config json? We try very hard not to add additional params to our initialize API. In the near future we are going to simplify these args even further.

Updated in latest commit.

This warning will be printed for all ranks for anyone who doesn't have torch_nebula installed (e.g., 256 gpu -> 256 prints). Can we ignore this import warning and raise an explicit error if and only if a user attempts to enable nebula?

Updated in latest commit.

…peed into trajepl/nebula_ckpt_engine

Merge from master

…bula_ckpt_engine

…peed into trajepl/nebula_ckpt_engine

trajepl added 3 commits June 23, 2022 10:16

Merge pull request #1 from microsoft/master

25e04b3

Sync the official deeprank master into forked repo

enable checkpoint engine

d88e591

seprated nebula config

07e59d6

trajepl requested review from RezaYazdaniAminabadi, ShadenSmith, arashb, awan-10, cli99, conglongli, duli2012, eltonzheng, jeffra, minjiaz, mrwyattii, samadejacobs, samyam, tjruwase, xiaoxiawu-microsoft and yaozhewei as code owners July 11, 2022 08:26

trajepl and others added 10 commits July 11, 2022 16:31

add __init__.py for nebula importing

4cbdfe6

linter fix

1f2f40c

fix: ds_config is None

d900145

fix: ds config

b44832b

fix: get sd loader fix

e4a57bd

Merge branch 'master' into trajepl/nebula_ckpt_engine

85e52f5

align the API with torch raw code

d70bcd1

Merge branch 'trajepl/nebula_ckpt_engine' of github.com:trajepl/DeepS…

4c50308

…peed into trajepl/nebula_ckpt_engine

linter fix

5d987a0

remove duplicate tag params

a04a81a

trajepl and others added 22 commits July 21, 2022 17:08

make checkpoint_engine as required args

4b42bc2

Merge branch 'trajepl/nebula_ckpt_engine' of github.com:trajepl/DeepS…

22f8c2a

…peed into trajepl/nebula_ckpt_engine

Merge pull request #2 from microsoft/master

bbd2bde

Merge from master

Merge branch 'master' of github.com:trajepl/DeepSpeed into trajepl/ne…

d9298cf

…bula_ckpt_engine

fix args

1906398

extract parameters out to config

432e7c6

fix: load state dict

7dbb6d8

separate load engine

e912e31

linter fix

7fc279b

Merge branch 'master' into trajepl/nebula_ckpt_engine

5ebacc6

extract checkpoint engine to abstract calss

c70c818

linter fix

e6dd794

Merge branch 'trajepl/nebula_ckpt_engine' of github.com:trajepl/DeepS…

3788ada

…peed into trajepl/nebula_ckpt_engine

construct function args fix

1efd2ce

add docs for dev/customers

dce0fb5

linter fix

bb5bb7c

Merge branch 'master' into trajepl/nebula_ckpt_engine

0c21dc2

remove load engine

3e8c238

print->log_dist

a5c8897

linter fix

44d687b

add tag flag to distinguish the loading order

82ad297

Merge branch 'master' into trajepl/nebula_ckpt_engine

cf12a8d

tjruwase approved these changes Jul 27, 2022

View reviewed changes

tjruwase and others added 4 commits July 27, 2022 05:24

Merge branch 'master' into trajepl/nebula_ckpt_engine

422221b

Merge branch 'master' into trajepl/nebula_ckpt_engine

340de11

Merge branch 'master' into trajepl/nebula_ckpt_engine

7f3f14c

Merge branch 'master' into trajepl/nebula_ckpt_engine

5071091

jeffra approved these changes Jul 27, 2022

View reviewed changes

Merge branch 'master' into trajepl/nebula_ckpt_engine

1b43df5

tjruwase merged commit e669aaf into deepspeedai:master Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trajepl/nebula ckpt engine #2085

Trajepl/nebula ckpt engine #2085

Uh oh!

trajepl commented Jul 11, 2022

Uh oh!

trajepl commented Jul 11, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Trajepl/nebula ckpt engine #2085

Trajepl/nebula ckpt engine #2085

Uh oh!

Conversation

trajepl commented Jul 11, 2022

Uh oh!

trajepl commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trajepl commented Jul 11, 2022 •

edited

Loading