-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
With multi-node runs using PDSH launcher, when there is a failure in the training script, the child training processes doesn't exit properly. This causes ssh session to wait indefinitely and doesn't return back to the pdsh command.
The issue happens due to the use of Popen.kill() here.
KILL signal cause the training processes to end abruptly. This may cause its child processes to become zombies without communicating properly to the parent process (launcher) about the kill signal. So the ssh session continue to wait for signals from the zombie process.
To Reproduce
- Start a multi-node job
- Raise an exception from the training script
Expected behavior
- The child training processes are terminated and the parent (launcher) process exits properly
- The ssh sessions returns back to the pdsh session in the main node and the deepspeed run exists.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.10.0a0+git36449ea
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.5.6+unknown, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: two machines with x8 A100s each
- Python version: 3.8.10
Launcher context
Deafult DeepSpeed launcher (PDSH)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working