Allow reuse address when firing up the ros2cli daemon.#947
Allow reuse address when firing up the ros2cli daemon.#947clalancette merged 3 commits intorollingfrom
Conversation
|
Pulls: #947 |
|
After discussing this with @cottsay a bit, we shouldn't do this on Windows; SO_REUSEADDR with multiple ports is basically UB. See https://learn.microsoft.com/en-us/windows/win32/winsock/using-so-reuseaddr-and-so-exclusiveaddruse . So I'm going to fix this to only make this change on Linux. It does mean that on Windows we still may have the flaky test, but at least we'll fix it on Linux. We'll have to think about how to do this on Windows separately. |
7a3bb44 to
7cb8b24
Compare
|
And 7cb8b24 fixes it to do just that. |
|
Pulls: #947 |
Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon. To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket. Signed-off-by: Chris Lalancette <clalancette@gmail.com>
7cb8b24 to
e8f06ed
Compare
fujitatomoya
left a comment
There was a problem hiding this comment.
IMO, it would make more sense to enable allow_reuse_address as this PR suggested?
the socket is set with SO_LINGER with timeout 0, that means it causes the socket to send a TCP RST (Reset) upon closure and terminates the connection immediately, discarding any unsent data.
SO_REUSEADDR allows binding to a port that is already in use (e.g., in the TIME_WAIT state). It doesn't terminate the TIME_WAIT state but bypasses the restriction, enabling rebinding.
From what I've read, this is only the case for Windows. The flag works slightly differently on BSD sockets. From man socket(7):
|
Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Chris Lalancette <clalancette@gmail.com>
|
All right, with all of the latest pushes here, I got things to work far more reliably on both Linux and Windows. I'm going to explain a bit about how we got here. What's the real problem we are trying to solve?The original problem here is that running the Why is that test failing?What happens when that test fails is pretty simple; the daemon fails to spawn correctly. Because we have no checking around this in ros2cli/ros2cli/ros2cli/node/daemon.py Lines 104 to 176 in 8766c33 How did we get to where the current code?There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation. #622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it. What's the solution?The solution here is in 3 parts, though one of the parts is platform-dependent. This PR implements all 3 parts:
Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch With all of the fixes in this PR, I'm no longer able to reproduce the flakiness in the server on either Linux or Windows. Next up I'm going to run regular CI on this, as well as an equivalent of the nightly repeated jobs to make sure things are happy. |
|
Pulls: #947 |
fujitatomoya
left a comment
There was a problem hiding this comment.
@clalancette good study, thanks for sharing 👍
Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon.
To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket.