Skip to content

destroy_process_group() has a certain probability of hangs #75097

@Howe-Young

Description

@Howe-Young

🐛 Describe the bug

Recently, I often encounter the problem of program hanging when using destroy_process_group().

code snippet:

dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo")
train(args)
if WORLD_SIZE > 1 and RANK == 0:
       LOGGER.info(f'rank: {RANK}, destroying process group... ')
       dist.destroy_process_group()
       LOGGER.info(f'rank: {RANK}, destroy process group finished')

The log as follows:

rank: 0, destroying process group...

When I use the nvidia-smi command, the results are as follows (note that due to the docker configuration problem, the python process id is not displayed, but the process actually exists):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   32C    P0    34W / 250W |   1409MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   27C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:09:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I can also see some Python processes using ps aux

Versions

PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.8 (default, Apr 13 2021, 19:58:26)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.mt20190308.130.el7.x86_64-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
GPU 2: Tesla V100-PCIE-16GB
GPU 3: Tesla V100-PCIE-16GB

Nvidia driver version: 450.51.06
cuDNN version: /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              hfd86e86_1
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py38h7f8727e_0
[conda] mkl_fft                   1.3.1            py38hd3c417c_0
[conda] mkl_random                1.2.2            py38h51133e4_0
[conda] numpy                     1.19.5                   pypi_0    pypi
[conda] pytorch                   1.8.0           py3.8_cuda10.2_cudnn7.6.5_0    pytorch
[conda] torchaudio                0.8.0                      py38    pytorch
[conda] torchvision               0.9.0                py38_cu102    pytorch

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions