Indigo devel fix rospy reconnect by atiderko · Pull Request #851 · ros/ros_comm

atiderko · 2016-07-29T09:34:07Z

while roscpp reconnects on timeout or temporary 'No route to host'
errors the rospy does it not. So on connection problems longer than 3
min the connection between subscriber and publisher goes.
I added some exceptions if these occurs rospy does not close the socket
and reconnect now.
since rospy reconnects on timeout now, the reconnects goes very fast in
times where it easier to restart the ros node then wait for a reconnect.
So I added a maximal backoff time.
on connection problems while node start the subscription to a publisher
will be stopped after 3 tries. Now the reconnection is stopped on
shutdown of the node.

while roscpp reconnects on timeout or temporary 'No route to host' errors the rospy does it not. So on connection problems longer than 3 min the connection between subscriber and publisher goes. I added some exceptions if these occurs rospy does not close the socket and reconnect now.

since rospy reconnects on timeout now, the reconnects goes very fast in times where it easier to restart the ros node then wait for a reconnect. So I added a maximal backoff time.

on connection problems while node start the subscription to a publisher will be stopped after 3 tries. Now the reconnection is stopped on shutdown of the node.

dirk-thomas · 2016-08-08T17:08:47Z

Some comments for the different changes:

First commit: Can you please elaborate what exceptions you expect and in which cases you want to continue retrying? Currently it looks like if the exception is not a socket.error it retries?

Second commit: An upper limit for the exponential backoff is a good point. Why does the patch use two different limits instead of the same?

Third commit: the current retry logic (3 times) is implemented in _connect_topic. This patch adds another level of retry on top of that in _connect_topic_thread. What is the rational to implement the new behavior in a different location and not modify the existing retry logic instead?

atiderko · 2016-08-12T09:44:41Z

First commit:

socket.timeout: on timeouts caused by delays on wireless links
ENETDOWN (100), ENETUNREACH (101), ENETRESET (102), ECONNABORTED (103): while using ROS_HOSTNAME ros binds to a specific interface. Theses errors are thrown on interface shutdown e.g. on reconnection in LTE networks
ETIMEDOUT (110): same like 1 (for completeness)
EHOSTDOWN (112), EHOSTUNREACH (113): while network and/or DNS-server is not reachable
Perhaps it is easier to reconnect on each socket.error, but there are also some errors on which the reconnection does make no sense.
Since there are no error message in user space, if the connection was aborted it would be useful to add a logerror output if the connection was closed.

Second commit:
I used the same values as they are used for timeout in connect-call (some lines above).

Third commit:
The implemented _connect_topic blocks in worst case 3 min. The new reconnect blocks in worst case until the rosnode is stopped. Therefore I added the new reconnect in _connect_topic_thread because this method is called in a thread as suggested by method name. Since it is nowhere documented that _connect_topic should only called in a thread, I did not want to add a non ending block. I should prefer to remove the 3 retries from _connect_topic, but it can also be done in the additional commit.

dirk-thomas · 2016-08-12T18:52:28Z

Thank you for the explanation. Let me try to rephrase / clarify my questions / concerns:

First commit:

Please add the rational for the error codes in a comment in the code so that future reader understand why they were added.
In the case of a timeout wouldn't it make sense to try to reconnect again?

I don't see a reason why it should retry if a different exception than socket.error is raised. I would expect the opposite:

if not isinstance(e, socket.error):
    # FATAL: no reconnection as error is unknown
    self.close()
elif not isinstance(e, socket.timeout) and e.errno not in [100, 101, 102, 103, 110, 112, 113]:
    # in this case the error is well known

Second commit:

I was referring to the two different thresholds within the patch: 60s vs 30s

Third commit:
Since _connect_topic is an internal function it shouldn't be used from the outside. I think it should be fine to modify the existing logic to do the "unlimited" retrying.

atiderko · 2016-08-12T22:41:02Z

ok, I see now my mistake in first commit ;-(

the connection should be closed in three cases

NOT a socket.error or
NOT a socket.timeout and
NOT a specific socket.errno

my pull request should be changed to (+ the rational for the error codes in a comment):

if not isinstance(e, socket.error):
    # FATAL: no reconnection as error is unknown
    self.close()
elif not isinstance(e, socket.timeout) and e.errno not in [100, 101, 102, 103, 110, 112, 113]:
    # in this case the error is well known
    self.close()

Second commit:
The 30s are because of timeout=30. and 60s because of timeout=60.. No clue why the timeouts are different.

Third commit:
I agree

Currently and for next three weeks I am in holiday and I am not able to change the pull request. Perhaps it is easier if you make the needed changes addressed by this pull request. And then ignore this request.

regards

* added/fixed description of cases to continue retrying * set upper limit for the exponential backoff to 32 sec * moved the retrying logic from `_connect_topic_thread` to the existing place in `_connect_topic` and replaced the existing one with 3 retries

atiderko · 2016-08-31T14:33:45Z

hi @dirk-thomas,

I fixed/added the changes we discussed

atiderko · 2016-09-02T08:58:47Z

something goes wrong: I changed only python code and rostest.runner.RosTest.testpubsub_n_fast_udp for cpp part was not successful. Or I'm misunderstanding the test results?

dirk-thomas · 2016-09-02T16:03:27Z

It could be a flaky test. Let me retrigger the jobs: @ros-pull-request-builder retest this please

dirk-thomas · 2016-10-24T18:21:14Z

The patch looks good to me. But I would merge it into the kinetic-devel branch instead.

@ros/ros_team Can you please review this PR.

wjwwood

Other than some small stylistic or comment formatting related comments, lgtm.

I agree that this should be target at kinetic. Either only kinetic or at least kinetic first with an option to backport to Indigo if things go well and there is desire for it.

We should solicit testers for this change too, as I could easily imagine it changes behavior in a breaking way for complex systems or even do something like effect ctrl-c behavior of simple scripts.

wjwwood · 2016-10-26T21:06:37Z

clients/rospy/src/rospy/impl/masterslave.py

-        while not success:
-            tries += 1
+        interval = 0.5 # seconds
+        # try to get the topic information until succes or the ROS node is shutdown.


"success" or probably "until successful".

I will address this on cherry-pick.

wjwwood · 2016-10-26T21:08:29Z

clients/rospy/src/rospy/impl/masterslave.py

-            tries += 1
+        interval = 0.5 # seconds
+        # try to get the topic information until succes or the ROS node is shutdown.
+        # -> on connections problems while ROS node start


What does -> mean in this case, return or exit or something else? I'd prefer an explicit comment, like "while ROS node is running try to connect, exit on connection problems" (not sure if that statement is accurate, but it is clearer than the current comment).

I updated it to:

while the ROS node is not shutdown try to get the topic information
and retry on connections problems after some wait

wjwwood · 2016-10-26T21:13:09Z

clients/rospy/src/rospy/impl/tcpros_base.py


-            if self.socket is None:
-                # exponential backoff
+            if self.socket is None and interval < 30.:


nitpick: I know that 30. is still technically correct to force it as a float literal, but 30.0 is the style used elsewhere, so I'd appreciate updating this for consistency.

Existing code just a few lines above uses 30. so I think its fine as is.

Then should new lines like this:

https://github.com/ros/ros_comm/pull/851/files#diff-20362b9a78750b6859885bf4bb7e2364R452
https://github.com/ros/ros_comm/pull/851/files#diff-8582a4a4f1da6f58343838915b4a9b55R176

Use this syntax?

I don't mind either or. The ROS 1 code base has pretty bad style and I don't see the way a float is written as important enough to spend more time on it.

Ok, fine by me.

dirk-thomas · 2016-10-26T23:31:42Z

Thank you for the patch and for iterating on it. I have cherry-picked the patch to the kinetic-devel branch: 14b5c93

atiderko added 3 commits July 29, 2016 11:18

rospy: limited exponential backoff

c4982dd

since rospy reconnects on timeout now, the reconnects goes very fast in times where it easier to restart the ros node then wait for a reconnect. So I added a maximal backoff time.

rospy: added reconnection for topic connect while node start

4c942c4

on connection problems while node start the subscription to a publisher will be stopped after 3 tries. Now the reconnection is stopped on shutdown of the node.

dirk-thomas added the more-information-needed label Aug 8, 2016

dirk-thomas removed the more-information-needed label Oct 24, 2016

wjwwood approved these changes Oct 26, 2016

View reviewed changes

dirk-thomas added the enhancement label Oct 26, 2016

dirk-thomas closed this Oct 26, 2016

dirk-thomas mentioned this pull request Mar 2, 2017

changes between 1.11.20 and 1.12.7 for backporting #1008

Merged

Conversation

atiderko commented Jul 29, 2016

Uh oh!

dirk-thomas commented Aug 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atiderko commented Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirk-thomas commented Aug 12, 2016

Uh oh!

atiderko commented Aug 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atiderko commented Aug 31, 2016

Uh oh!

atiderko commented Sep 2, 2016

Uh oh!

dirk-thomas commented Sep 2, 2016

Uh oh!

dirk-thomas commented Oct 24, 2016

Uh oh!

wjwwood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dirk-thomas commented Oct 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dirk-thomas commented Aug 8, 2016 •

edited

Loading

atiderko commented Aug 12, 2016 •

edited

Loading

atiderko commented Aug 12, 2016 •

edited

Loading