Skip to content

Conversation

@aj-prime
Copy link
Contributor

@aj-prime aj-prime commented Jul 28, 2022

This PR adds basic elastic training support to DeepSpeed using Elastic Agent in PyTorch.

Requirement:
PyTorch >= 1.11.x

Modifications:

  1. Added runtime options for elastic training in deepspeed launcher.
  2. Implemented a new Elastic training agent for DeepSpeed based on PyTorch Elastic Agent.
  3. Requires user to save and load latest checkpoint
  4. Added Elasticity V 0.2 that relaxes one batch size constraint.
  5. Tested Megatron and HelloDeepSpeed models on Deepspeed cluster

Related Repos:

Megatron: https://github.com/microsoft/Megatron-DeepSpeed/tree/arpan/elastic_training
DeepSpeedExamples: https://github.com/microsoft/DeepSpeedExamples-internal/tree/arpan/elastic_scripts

Instructions to launch elastic training
Megatron Model: https://github.com/microsoft/Megatron-DeepSpeed/blob/arpan/elastic_training/examples/MoE/README_elastic.md
HelloDeepSpeed: https://github.com/microsoft/DeepSpeedExamples-internal/edit/arpan/elastic_scripts/elastic_training/README.md

Log:
We start training on allocation 1 (2 nodes) and drop one node after 40th iteration.

Iterations World Size Number of Nodes in Training
Allocation 1 Allocation 2
0-40 8 2 0
40-60 4 1 0
60-80 12 1 2
80-100 8 1 1
100-140 12 2 1
140-160 16 2 2
160-180 12 2 1
180-200 8 1 1
200 - 4 1 0

Workflow

PR is divided into two modules that can be enabled separately or simultaneously.

Runner module that focuses on relaunching of training in scale-up or scale-down event
workflow_runner_module png

Valid Batch Size computation in Elastic training
workflow_elasticity_v2 png

@awan-10 awan-10 merged commit 63ae1c5 into staging-ft-elastic-v1 Jul 29, 2022
jeffra added a commit that referenced this pull request Jul 29, 2022
Co-authored-by: Arpan Jain <t-arpanjain@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
@mrwyattii mrwyattii deleted the arpan/elasticity branch July 7, 2023 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants