Skip to content
This repository was archived by the owner on May 31, 2025. It is now read-only.

Abort topic lookup on connection refused#1044

Merged
dirk-thomas merged 1 commit intoros:lunar-develfrom
clearpathrobotics:ga_abort_on_connect_refused
May 12, 2017
Merged

Abort topic lookup on connection refused#1044
dirk-thomas merged 1 commit intoros:lunar-develfrom
clearpathrobotics:ga_abort_on_connect_refused

Conversation

@guillaumeautran
Copy link
Copy Markdown
Contributor

In a multimaster environment where a topic has multiple publishers,
when a node drops out abruptly (host is shutdown), a single subscriber update on
that topic will cause multiple threads to be created (one for each host) in order to
resolve the topic location. This cause a thread leak as host which are turned off
will not respond and when they come back online, the xmlrpc URI is changed causing a
connection refused error at the socket layer.

This fix catches the connection refused error and terminate the thread with the understanding
that if the connection is refused, the rosnode cannot be reached now or never. This effectively
prevents thread leak.

Note: if the remote host where the rosnode is thought to be never comes back up,
then the thread will still be leaked as the exception received is a host unreachable type.
This is intentional to avoid abruptly terminating the thread in case of a temporary DNS failure.

if getattr(e, 'errno', None) == errno.ECONNREFUSED:
code = errno.ECONNREFUSED
msg = str(e)
break
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return down below only triggers when code <= 0, but you're assigning it a positive value. What is the expected behaviour here, when we break out of the while loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the intention is to return at the early return. But the errno will be positive. Let me fix that.

In a multimaster environment where a topic has multiple publishers,
when a node drops out abruptly (host is shutdown), a single subscriber update on
that topic will cause multiple threads to be created (one for each host) in order to
resolve the topic location. This cause a thread leak as host which are turned off
will not respond and when they come back online, the xmlrpc URI is changed causing a
connection refused error at the socket layer.

This fix catches the connection refused error and terminate the thread with the understanding
that if the connection is refused, the rosnode cannot be reached now or never. This effectively
prevents thread leak.

Note: if the remote host where the rosnode is thought to be never comes back up,
then the thread will still be leaked as the exception received is a host unreachable type.
This is intentional to avoid abruptly terminating the thread in case of a temporary DNS failure.
@guillaumeautran guillaumeautran force-pushed the ga_abort_on_connect_refused branch from e41901f to a61b63c Compare May 2, 2017 15:31
@guillaumeautran
Copy link
Copy Markdown
Contributor Author

Ping for review...

@mikepurvis
Copy link
Copy Markdown
Member

@dirk-thomas We've been running this on our bots for the week; would like to get it merged if there's no objection.

@dirk-thomas
Copy link
Copy Markdown
Member

Sounds good to me. Thank you both for your effort on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants