Similar to the issue in mmseg. Here I just quote:
In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):
for loss_name, loss_value in log_vars.items():
# reduce loss when distributed training
if dist.is_available() and dist.is_initialized():
loss_value = loss_value.data.clone()
dist.all_reduce(loss_value.div_(dist.get_world_size()))
log_vars[loss_name] = loss_value.item()
One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.
This bug is hard to debug, as it just gets blocked without error messages.
Similar to the issue in mmseg. Here I just quote:
This bug is hard to debug, as it just gets blocked without error messages.