Issue description
The timeout setting in the function init_process_group in DDP don not work.
I found that after release v1.3.0, the TCPStore initilizaiton function will can connect the server with timeout, but the timeout value is actually set after the TCPStore initialization function.
init_process_group -> TCPStore()-> set timeout
This mean the first connect will always use the default timeout (300 seconds).
in v1.2.0, such call stack will be fine since the first connect in initialization will call with no timeout.
// Connect to the daemon
storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_);
after v.1.3.0, the first connect will be
storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_, /* wait= */ true, timeout_);
which will result in the bug.
my suggestion is that we can connect to server without timeout like v1.2.0 for the first call.
Code example
Please try to provide a minimal example to repro the bug.
Error messages and stack traces are also helpful.
System Info
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch or Caffe2:
- How you installed PyTorch (conda, pip, source):
- Build command you used (if compiling from source):
- OS:
- PyTorch version:
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- GCC version (if compiling from source):
- CMake version:
- Versions of any other relevant libraries:
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar
Issue description
The timeout setting in the function init_process_group in DDP don not work.
I found that after release v1.3.0, the TCPStore initilizaiton function will can connect the server with timeout, but the timeout value is actually set after the TCPStore initialization function.
init_process_group -> TCPStore()-> set timeout
This mean the first connect will always use the default timeout (300 seconds).
in v1.2.0, such call stack will be fine since the first connect in initialization will call with no timeout.
// Connect to the daemon
storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_);
after v.1.3.0, the first connect will be
storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_, /* wait= */ true, timeout_);
which will result in the bug.
my suggestion is that we can connect to server without timeout like v1.2.0 for the first call.
Code example
Please try to provide a minimal example to repro the bug.
Error messages and stack traces are also helpful.
System Info
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar