-
Notifications
You must be signed in to change notification settings - Fork 27.4k
destroy_process_group() has a certain probability of hangs #75097
Copy link
Copy link
Open
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue
Description
🐛 Describe the bug
Recently, I often encounter the problem of program hanging when using destroy_process_group().
code snippet:
dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo")
train(args)
if WORLD_SIZE > 1 and RANK == 0:
LOGGER.info(f'rank: {RANK}, destroying process group... ')
dist.destroy_process_group()
LOGGER.info(f'rank: {RANK}, destroy process group finished')The log as follows:
rank: 0, destroying process group...
When I use the nvidia-smi command, the results are as follows (note that due to the docker configuration problem, the python process id is not displayed, but the process actually exists):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:04:00.0 Off | 0 |
| N/A 32C P0 34W / 250W | 1409MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:05:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 4MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:08:00.0 Off | 0 |
| N/A 28C P0 24W / 250W | 4MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:09:00.0 Off | 0 |
| N/A 28C P0 24W / 250W | 4MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
I can also see some Python processes using ps aux
Versions
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.mt20190308.130.el7.x86_64-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
GPU 2: Tesla V100-PCIE-16GB
GPU 3: Tesla V100-PCIE-16GB
Nvidia driver version: 450.51.06
cuDNN version: /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.19.5 pypi_0 pypi
[conda] pytorch 1.8.0 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchaudio 0.8.0 py38 pytorch
[conda] torchvision 0.9.0 py38_cu102 pytorch
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue