Skip to content

Connect timeout feature do not work in DDP with TCPStore #32924

@boren-ms

Description

@boren-ms

Issue description

The timeout setting in the function init_process_group in DDP don not work.
I found that after release v1.3.0, the TCPStore initilizaiton function will can connect the server with timeout, but the timeout value is actually set after the TCPStore initialization function.
init_process_group -> TCPStore()-> set timeout
This mean the first connect will always use the default timeout (300 seconds).

in v1.2.0, such call stack will be fine since the first connect in initialization will call with no timeout.
// Connect to the daemon
storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_);

after v.1.3.0, the first connect will be
storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_, /* wait= */ true, timeout_);
which will result in the bug.

my suggestion is that we can connect to server without timeout like v1.2.0 for the first call.

Code example

Please try to provide a minimal example to repro the bug.
Error messages and stack traces are also helpful.

System Info

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch or Caffe2:
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • OS:
  • PyTorch version:
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • GCC version (if compiling from source):
  • CMake version:
  • Versions of any other relevant libraries:

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions