[BUG] Not exiting gracefully on failures from training script for multi-node runs

**Describe the bug**
With multi-node runs using PDSH launcher, when there is a failure in the [training script](https://github.com/microsoft/DeepSpeed/blob/6719b46bd8792755cfc75eaa448bf7fa65fd99eb/deepspeed/launcher/launch.py#L78), the child training processes doesn't exit properly. This causes ssh session to wait indefinitely and doesn't return back to the pdsh command. 

The issue happens due to the use of Popen.kill() [here](https://github.com/microsoft/DeepSpeed/blob/6719b46bd8792755cfc75eaa448bf7fa65fd99eb/deepspeed/launcher/launch.py#L180).
KILL signal cause the training processes to end abruptly. This may cause its child processes to become zombies without communicating properly to the parent process (launcher) about the kill signal. So the ssh session continue to wait for signals from the zombie process.

**To Reproduce**
1. Start a multi-node job
2. Raise an exception from the training script

**Expected behavior**
1. The child training processes are terminated and the parent (launcher) process exits properly
2. The ssh sessions returns back to the pdsh session in the main node and the deepspeed run exists.

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.10.0a0+git36449ea
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.5.6+unknown, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3
```

**System info (please complete the following information):**
 - OS: Ubuntu 20.04
 - GPU count and types: two machines with x8 A100s each
 - Python version: 3.8.10

**Launcher context**
Deafult DeepSpeed launcher (PDSH)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Not exiting gracefully on failures from training script for multi-node runs #1995

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Not exiting gracefully on failures from training script for multi-node runs #1995

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions