Skip to content

[BUG] Not exiting gracefully on failures from training script for multi-node runs #1995

@jerrymannil

Description

@jerrymannil

Describe the bug
With multi-node runs using PDSH launcher, when there is a failure in the training script, the child training processes doesn't exit properly. This causes ssh session to wait indefinitely and doesn't return back to the pdsh command.

The issue happens due to the use of Popen.kill() here.
KILL signal cause the training processes to end abruptly. This may cause its child processes to become zombies without communicating properly to the parent process (launcher) about the kill signal. So the ssh session continue to wait for signals from the zombie process.

To Reproduce

  1. Start a multi-node job
  2. Raise an exception from the training script

Expected behavior

  1. The child training processes are terminated and the parent (launcher) process exits properly
  2. The ssh sessions returns back to the pdsh session in the main node and the deepspeed run exists.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.10.0a0+git36449ea
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.5.6+unknown, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types: two machines with x8 A100s each
  • Python version: 3.8.10

Launcher context
Deafult DeepSpeed launcher (PDSH)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions