Skip to content
This repository was archived by the owner on May 31, 2025. It is now read-only.

Indigo devel fix rospy reconnect#851

Closed
atiderko wants to merge 4 commits intoros:indigo-develfrom
fkie-forks:indigo-devel-fix-rospy-reconnect
Closed

Indigo devel fix rospy reconnect#851
atiderko wants to merge 4 commits intoros:indigo-develfrom
fkie-forks:indigo-devel-fix-rospy-reconnect

Conversation

@atiderko
Copy link
Copy Markdown
Contributor

  1. while roscpp reconnects on timeout or temporary 'No route to host'
    errors the rospy does it not. So on connection problems longer than 3
    min the connection between subscriber and publisher goes.
    I added some exceptions if these occurs rospy does not close the socket
    and reconnect now.
  2. since rospy reconnects on timeout now, the reconnects goes very fast in
    times where it easier to restart the ros node then wait for a reconnect.
    So I added a maximal backoff time.
  3. on connection problems while node start the subscription to a publisher
    will be stopped after 3 tries. Now the reconnection is stopped on
    shutdown of the node.

atiderko added 3 commits July 29, 2016 11:18
while roscpp reconnects on timeout or temporary 'No route to host'
errors the rospy does it not. So on connection problems longer than 3
min the connection between subscriber and publisher goes.
I added some exceptions if these occurs rospy does not close the socket
and reconnect now.
since rospy reconnects on timeout now, the reconnects goes very fast in
times where it easier to restart the ros node then wait for a reconnect.
So I added a maximal backoff time.
on connection problems while node start the subscription to a publisher
will be stopped after 3 tries. Now the reconnection is stopped on
shutdown of the node.
@dirk-thomas
Copy link
Copy Markdown
Member

dirk-thomas commented Aug 8, 2016

Some comments for the different changes:

First commit: Can you please elaborate what exceptions you expect and in which cases you want to continue retrying? Currently it looks like if the exception is not a socket.error it retries?

Second commit: An upper limit for the exponential backoff is a good point. Why does the patch use two different limits instead of the same?

Third commit: the current retry logic (3 times) is implemented in _connect_topic. This patch adds another level of retry on top of that in _connect_topic_thread. What is the rational to implement the new behavior in a different location and not modify the existing retry logic instead?

@atiderko
Copy link
Copy Markdown
Contributor Author

atiderko commented Aug 12, 2016

First commit:

  1. socket.timeout: on timeouts caused by delays on wireless links
  2. ENETDOWN (100), ENETUNREACH (101), ENETRESET (102), ECONNABORTED (103): while using ROS_HOSTNAME ros binds to a specific interface. Theses errors are thrown on interface shutdown e.g. on reconnection in LTE networks
  3. ETIMEDOUT (110): same like 1 (for completeness)
  4. EHOSTDOWN (112), EHOSTUNREACH (113): while network and/or DNS-server is not reachable
    Perhaps it is easier to reconnect on each socket.error, but there are also some errors on which the reconnection does make no sense.
    Since there are no error message in user space, if the connection was aborted it would be useful to add a logerror output if the connection was closed.

Second commit:
I used the same values as they are used for timeout in connect-call (some lines above).

Third commit:
The implemented _connect_topic blocks in worst case 3 min. The new reconnect blocks in worst case until the rosnode is stopped. Therefore I added the new reconnect in _connect_topic_thread because this method is called in a thread as suggested by method name. Since it is nowhere documented that _connect_topic should only called in a thread, I did not want to add a non ending block. I should prefer to remove the 3 retries from _connect_topic, but it can also be done in the additional commit.

@dirk-thomas
Copy link
Copy Markdown
Member

Thank you for the explanation. Let me try to rephrase / clarify my questions / concerns:

First commit:

  • Please add the rational for the error codes in a comment in the code so that future reader understand why they were added.

  • In the case of a timeout wouldn't it make sense to try to reconnect again?

  • I don't see a reason why it should retry if a different exception than socket.error is raised. I would expect the opposite:

    if not isinstance(e, socket.error):
        # FATAL: no reconnection as error is unknown
        self.close()
    elif not isinstance(e, socket.timeout) and e.errno not in [100, 101, 102, 103, 110, 112, 113]:
        # in this case the error is well known
    

Second commit:

  • I was referring to the two different thresholds within the patch: 60s vs 30s

Third commit:
Since _connect_topic is an internal function it shouldn't be used from the outside. I think it should be fine to modify the existing logic to do the "unlimited" retrying.

@atiderko
Copy link
Copy Markdown
Contributor Author

atiderko commented Aug 12, 2016

ok, I see now my mistake in first commit ;-(

the connection should be closed in three cases

  1. NOT a socket.error or
  2. NOT a socket.timeout and
  3. NOT a specific socket.errno

my pull request should be changed to (+ the rational for the error codes in a comment):

if not isinstance(e, socket.error):
    # FATAL: no reconnection as error is unknown
    self.close()
elif not isinstance(e, socket.timeout) and e.errno not in [100, 101, 102, 103, 110, 112, 113]:
    # in this case the error is well known
    self.close()

Second commit:
The 30s are because of timeout=30. and 60s because of timeout=60.. No clue why the timeouts are different.

Third commit:
I agree

Currently and for next three weeks I am in holiday and I am not able to change the pull request. Perhaps it is easier if you make the needed changes addressed by this pull request. And then ignore this request.

regards

* added/fixed description of cases to continue retrying
* set upper limit for the exponential backoff to 32 sec
* moved the retrying logic from `_connect_topic_thread` to the existing
place in `_connect_topic` and replaced the existing one with 3 retries
@atiderko
Copy link
Copy Markdown
Contributor Author

hi @dirk-thomas,

I fixed/added the changes we discussed

@atiderko
Copy link
Copy Markdown
Contributor Author

atiderko commented Sep 2, 2016

something goes wrong: I changed only python code and rostest.runner.RosTest.testpubsub_n_fast_udp for cpp part was not successful. Or I'm misunderstanding the test results?

@dirk-thomas
Copy link
Copy Markdown
Member

It could be a flaky test. Let me retrigger the jobs: @ros-pull-request-builder retest this please

@dirk-thomas
Copy link
Copy Markdown
Member

The patch looks good to me. But I would merge it into the kinetic-devel branch instead.

@ros/ros_team Can you please review this PR.

Copy link
Copy Markdown
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than some small stylistic or comment formatting related comments, lgtm.

I agree that this should be target at kinetic. Either only kinetic or at least kinetic first with an option to backport to Indigo if things go well and there is desire for it.

We should solicit testers for this change too, as I could easily imagine it changes behavior in a breaking way for complex systems or even do something like effect ctrl-c behavior of simple scripts.

while not success:
tries += 1
interval = 0.5 # seconds
# try to get the topic information until succes or the ROS node is shutdown.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"success" or probably "until successful".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this on cherry-pick.

tries += 1
interval = 0.5 # seconds
# try to get the topic information until succes or the ROS node is shutdown.
# -> on connections problems while ROS node start
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does -> mean in this case, return or exit or something else? I'd prefer an explicit comment, like "while ROS node is running try to connect, exit on connection problems" (not sure if that statement is accurate, but it is clearer than the current comment).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to:

while the ROS node is not shutdown try to get the topic information
and retry on connections problems after some wait


if self.socket is None:
# exponential backoff
if self.socket is None and interval < 30.:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I know that 30. is still technically correct to force it as a float literal, but 30.0 is the style used elsewhere, so I'd appreciate updating this for consistency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing code just a few lines above uses 30. so I think its fine as is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either or. The ROS 1 code base has pretty bad style and I don't see the way a float is written as important enough to spend more time on it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fine by me.

@dirk-thomas
Copy link
Copy Markdown
Member

Thank you for the patch and for iterating on it. I have cherry-picked the patch to the kinetic-devel branch: 14b5c93

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants