Elastic Training support in DeepSpeed #2153

aj-prime · 2022-07-28T21:48:30Z

This PR adds basic elastic training support to DeepSpeed using Elastic Agent in PyTorch.

Requirement:
PyTorch >= 1.11.x

Modifications:

Added runtime options for elastic training in deepspeed launcher.
Implemented a new Elastic training agent for DeepSpeed based on PyTorch Elastic Agent.
Requires user to save and load latest checkpoint
Added Elasticity V 0.2 that relaxes one batch size constraint.
Tested Megatron and HelloDeepSpeed models on Deepspeed cluster

Related Repos:

Megatron: https://github.com/microsoft/Megatron-DeepSpeed/tree/arpan/elastic_training
DeepSpeedExamples: https://github.com/microsoft/DeepSpeedExamples-internal/tree/arpan/elastic_scripts

Instructions to launch elastic training
Megatron Model: https://github.com/microsoft/Megatron-DeepSpeed/blob/arpan/elastic_training/examples/MoE/README_elastic.md
HelloDeepSpeed: https://github.com/microsoft/DeepSpeedExamples-internal/edit/arpan/elastic_scripts/elastic_training/README.md

Log:
We start training on allocation 1 (2 nodes) and drop one node after 40th iteration.

Iterations	World Size	Number of Nodes in Training
		Allocation 1	Allocation 2
0-40	8	2	0
40-60	4	1	0
60-80	12	1	2
80-100	8	1	1
100-140	12	2	1
140-160	16	2	2
160-180	12	2	1
180-200	8	1	1
200 -	4	1	0

Workflow

PR is divided into two modules that can be enabled separately or simultaneously.

Runner module that focuses on relaunching of training in scale-up or scale-down event

Valid Batch Size computation in Elastic training

…nd add ENABLE constants to init

…ct WORLD_SIZE computation

… training in docs, and added an assertion in runner for elastic training

deepspeed/launcher/launch.py

deepspeed/launcher/runner.py

deepspeed/launcher/launch.py

Co-authored-by: Arpan Jain <t-arpanjain@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Arpan Jain and others added 23 commits June 9, 2022 15:30

proof of concept for elastic training using pytorch

c5ce9d8

Add command line options for elastic training

5635052

Remove functionAgent

9b5e72c

Add NCCL BLOCKING ERROR flag to elastic training

4881ce8

transient change

9f1c997

Added DS elastic agent

183b6bf

Cleanup

4d024f0

pass environment variables to worker processes

1ab3104

Enable elastic checkpoint for scale down in elastic training

f55b767

added detection of master addr and port on rank 0

e37c761

fixed formatting

a41757d

Merge lastest master to elasticity branch

95f70ce

add launch.py and elastic_agent.py files to skip list in torchdist check

f637569

add pytorch dependency for elastic training

8aafc3a

add function for checking pytorch version

b0a8802

added kill command for pdsh when SIGINT is received

96d678b

re-enable elastic checkpoint assertion

3ac5119

Merge branch 'staging-ft-elastic-v1' into arpan/elasticity

a23594e

Add support for variable batch size

4f9c535

Fix elasticity V2, enable pipeline parallelism in Elastic Training, a…

d995fb3

…nd add ENABLE constants to init

updated elastic unit test

706ebce

added an assertion for moded-parallel support and added code to prote…

9063a94

…ct WORLD_SIZE computation

modified elastic training unit test, added config options for elastic…

f4ace71

… training in docs, and added an assertion in runner for elastic training

aj-prime requested review from ShadenSmith, awan-10, cli99, conglongli, jeffra, samyam and tjruwase as code owners July 28, 2022 21:48

aj-prime requested review from RezaYazdaniAminabadi, arashb, duli2012, eltonzheng, minjiaz, mrwyattii, samadejacobs, xiaoxiawu-microsoft and yaozhewei as code owners July 28, 2022 21:48

Arpan Jain added 4 commits July 28, 2022 21:59

resolved conflicts

bb4a7f3

fixed a typo

6ed8066

fixed test_elastic

f2405bd

removed extra imports

66205a1

tjruwase reviewed Jul 29, 2022

View reviewed changes

deepspeed/launcher/launch.py Outdated Show resolved Hide resolved

tjruwase reviewed Jul 29, 2022

View reviewed changes

deepspeed/launcher/runner.py Outdated Show resolved Hide resolved

Arpan Jain added 2 commits July 29, 2022 15:31

renamed min and max nodes arguments

7e601b3

use deafult elastic ID

6de2cd8

jeffra reviewed Jul 29, 2022

View reviewed changes

deepspeed/launcher/launch.py Outdated Show resolved Hide resolved

expose elastic run id as an env variable

ae26b52

jeffra approved these changes Jul 29, 2022

View reviewed changes

awan-10 merged commit 63ae1c5 into staging-ft-elastic-v1 Jul 29, 2022

jeffra added a commit that referenced this pull request Jul 29, 2022

Elastic Training support in DeepSpeed (#2153) (#2156)

1ed5aa9

Co-authored-by: Arpan Jain <t-arpanjain@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

mrwyattii deleted the arpan/elasticity branch July 7, 2023 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elastic Training support in DeepSpeed #2153

Elastic Training support in DeepSpeed #2153

Uh oh!

aj-prime commented Jul 28, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Elastic Training support in DeepSpeed #2153

Elastic Training support in DeepSpeed #2153

Uh oh!

Conversation

aj-prime commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modifications:

Related Repos:

Workflow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aj-prime commented Jul 28, 2022 •

edited

Loading