Use fixed MASTER_PORT in test_distributed#13109
Use fixed MASTER_PORT in test_distributed#13109pietern wants to merge 1 commit intopytorch:masterfrom
Conversation
Summary: The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes pytorch#12876. Differential Revision: D10850614 fbshipit-source-id: 9af1ffb44150063c1f0cbd2bf3462a13b4d55fc1
|
This probably will work better, if we always manage to reliably shut down the servers after every test, but it's not the 100% right approach. Might be good enough. I think the actually "right" way to do this is to spawn the server and have it automatically assign itself a port, then have it communicate the port to the parent process somehow, so that you can send to it. Sometimes this is inconvenient to do, in which case another robust way is to keep spawning the server configured with different ports until it succeeds. |
|
@ezyang Agreed. I took a stab at the right approach (exactly what you mention) but it would have involved breaking up how |
Summary: Pull Request resolved: pytorch#13109 The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes pytorch#12876. Differential Revision: D10850614 fbshipit-source-id: c19f12bb4916141187ee8ddb52880f5f418310dc
Summary: The "right" strategy of creating a socket, binding to an undefined port, closing the socket, and reusing the port it was bound to, was subject to a race condition. Another process could bind to that same port sooner than the tests would, causing an "Address already in use" failure when rank 0 would try and bind to that same port. The THD tests have been using a fixed port since forever. Time will tell if this fixes #12876.
Differential Revision: D10850614