-
Notifications
You must be signed in to change notification settings - Fork 927
rosmaster leaves sockets in CLOSE_WAIT state #610
Description
Somehow related to #495, but not reproducible as easily.
In some scenarios, likely involving a node (or nodes) restarting, rosmaster can leave sockets in a CLOSE_WAIT state, eventually exhausting the limit on open file descriptors and becoming unresponsive.
Our system has about 30 nodes and 150 topics, and when the system is run without restarting, there is a steady creep upwards in the amount of CLOSE_WAIT sockets in systemwide monitoring. Detailed examination assigns most of these to the rosmaster process. See picture.
The sudden jump in the image has not been fully explained but it's likely associated with a hardware problem that caused parts of the system to restart repeatedly. Still, the trend is obvious.
After diagnosing the problem we've been trying to reproduce it with a simpler setup and/or collect logs, and the results are inconclusive. After comparing the rosmaster logs with lsof output, the leaks do appear to be related to ServerProxy objects.
It does not seem to matter whether the node uses roscpp or rospy, or whether it's a publisher-only or a subscriber-only. There are even instances of CLOSE_WAIT sockets being associated with nodes that do not publish or subscribe anything (they just read some parameters on startup).
We've run a belt-and-suspenders approach since hitting the limit, which involves a) nightly restarts and b) keeping the ServerProxy cache trimmed (patch below).
The nightly restarts have kept us from hitting the limit, but we're planning to run without them for a while to see if the ServerProxy patch helps. The default daily cycle already looks different:
The spikes are due to rosbag restarting every hour so we get bags to manageable size.
When there is a leak, it seems to be contained:
(These are system-wide numbers, rosmaster has 90 CLOSE_WAIT sockets)
--- util.py.old 2015-04-22 12:23:11.054085143 +0300
+++ util.py.new 2015-04-22 12:23:18.418073585 +0300
@@ -45,7 +45,12 @@
except ImportError:
from xmlrpclib import ServerProxy
-_proxies = {} #cache ServerProxys
+import collections
+import threading
+_proxies = collections.OrderedDict()
+_lock = threading.Lock()
+N = 100
+
def xmlrpcapi(uri):
"""
@return: instance for calling remote server or None if not a valid URI
@@ -56,11 +61,23 @@
uriValidate = urlparse(uri)
if not uriValidate[0] or not uriValidate[1]:
return None
- if not uri in _proxies:
- _proxies[uri] = ServerProxy(uri)
+ # Experimental bit
+ with _lock:
+ proxy = _proxies.get(uri, None)
+ if proxy is None:
+ proxy = ServerProxy(uri)
+ else:
+ # OrderedDict requires a deletion to update order
+ del _proxies[uri]
+ _proxies[uri] = proxy
+ # Trim to size
+ while len(_proxies) > N:
+ _proxies.popitem(last=False)
+
return _proxies[uri]
def remove_server_proxy(uri):
- if uri in _proxies:
- del _proxies[uri]
+ with _lock:
+ if uri in _proxies:
+ del _proxies[uri]


