mgr: fix PyObject* refcounting in TTLCache and cleanup logic#65245
Conversation
24d05be to
d699fca
Compare
epuertat
left a comment
There was a problem hiding this comment.
Looks mostly good, Nitzan! Just left a few comments and questions. Hope they're useful.
src/pybind/mgr/prometheus/module.py
Outdated
| import threading | ||
| import time | ||
| import enum | ||
| import gc |
There was a problem hiding this comment.
Instead of bringing Python's explicit gc into play (which can be a risky guest), we could use Python's weakref to avoid increasing the ref counting on the Python side. For example, that healthcheck data structure could be a weakref.WeakValueDictionary object.
I thought that the functools cache helpers already used weak refs, but I was wrong. It's the non-standard cachetools library the one that uses it (for key values as it uses it for function memoization).
There was a problem hiding this comment.
I added explicit gc.collect() to maintain low memory footprint in long-lived mgr sessions. That said, I’m open to exploring weakref-based caches or LRU strategies in a follow-up.
There was a problem hiding this comment.
Invoking a gc.collect() in Python triggers a complete scan of all in-memory data, including deeply nested dicts/large collections. If we're not keeping complex data structures with circular references (where weakrefs are usually the best approach), there's no benefit in invoking gc.collect(). Ref-counting is enough to deallocate resources. Is this based on a real benchmarking?
There was a problem hiding this comment.
I can remove the gc.collect() now after changing the refcounts in the Cache itself. According to the massif output, the memory that the mgr api no longer leaks that memory
src/pybind/mgr/prometheus/module.py
Outdated
| # Sort by last_seen and remove oldest entries | ||
| sorted_checks = sorted(self.healthcheck.values(), key=lambda x: x.last_seen) | ||
| for check in sorted_checks[:len(self.healthcheck) - self.max_entries]: | ||
| self.healthcheck.pop(check.name, None) |
There was a problem hiding this comment.
Why not using functools LRU cache (with weakrefs)? With that we don't need TTL (weakrefs will be voided on the mgr C++ side), and that cache explicitly tackles the size count.
There was a problem hiding this comment.
@epuertat sorry, i missed that one, can you please explain what you mean by that?
This is for historical health monitoring - the mgr C++ side gives us the current state, but we want to maintain history of health checks over time. We need to prevent the history from growing beyond max_entries, and currently we also expire old entries after x time.
I think we could get rid of the TTL and only maintain max_entries limit - is that what you mean? Or are you suggesting a different approach for the historical tracking itself?
There was a problem hiding this comment.
Yeah, I was suggesting that we simplified that struct. What about a collections.deque? It's basically a fixed-size FIFO queue. I assume that the events are inserted in order (maybe I'm wrong here), so if that's true, then we don't need an expiration time, just the regular FIFO eviction.
| TTLCache(uint16_t ttl_ = 0, uint16_t size = UINT16_MAX, float spread = 0.25) | ||
| : TTLCacheBase<Key, PyObject*>(ttl_, size, spread) {} | ||
| ~TTLCache(){}; | ||
| ~TTLCache(){ clear(); }; |
There was a problem hiding this comment.
Should we implement this method for the base Cache instead?
There was a problem hiding this comment.
@epuertat i tried to move it to base cache, the problem is that Cache is implemented with <key, value> and not <key, PyObject*>, that means we need to duplicate it with <key, PyObject*> or add check for value == PyObject* and that starting to be more complicated then implement it in the TTLCache<Key, PyObject*>
There was a problem hiding this comment.
Oh, I thought that the base class was templated to support arbitrary key/value types.
0757719 to
52b83eb
Compare
|
@epuertat Thanks a lot for reviewing! i fixed and repushed, please take a look |
doc/mgr/prometheus.rst
Outdated
| These limits are configurable via the following runtime options: | ||
|
|
||
| ``mgr/prometheus/healthcheck_history_max_entries`` - the maximum number of health check events to retain in memory (default: 1000). | ||
| ``mgr/prometheus/heal thcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600). |
There was a problem hiding this comment.
| ``mgr/prometheus/heal thcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600). | |
| ``mgr/prometheus/healthcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600). |
52b83eb to
55fbc4d
Compare
7daac31 to
679b2fb
Compare
| with self.lock: | ||
| changes_made = False | ||
| names = set(self.healthcheck) | set(current_checks) | ||
|
|
||
| for name in names: | ||
| present = name in current_checks | ||
| check = self.healthcheck.get(name) | ||
| if check is None: | ||
| if present: | ||
| info = current_checks[name] | ||
| self.healthcheck[name] = HealthCheckEvent( | ||
| name=name, | ||
| severity=info.get('severity'), | ||
| first_seen=now, | ||
| last_seen=now, | ||
| count=1, | ||
| active=True | ||
| ) | ||
| changes_made = True | ||
|
|
||
| continue |
There was a problem hiding this comment.
Does this work?? Sorry for the misleading suggestion: I just realized that the self.healthcheck cache is used for random access (by check name), so the deque won't work, since it's index-based (does it?? 🤯 ).
We would need a mixture of map/dict with a queue, something like:
from collections import OrderedDict
class LRUCacheDict(OrderedDict):
def __init__(self, maxsize, *args, **kwargs):
self.maxsize = maxsize
super().__init__(*args, **kwargs)
def __setitem__(self, key, value):
if key in self:
del self[key] # refresh position
elif len(self) >= self.maxsize:
self.popitem(last=False) # drop oldest
super().__setitem__(key, value)There was a problem hiding this comment.
@epuertat ok, i'll redo the LRUCacheDict!
bad51b3 to
68a6bc8
Compare
|
@NitzanMordhai - Please resolve check issues, and feel free to proceed with merging the PR. |
0362541 to
e2bb649
Compare
|
jenkins test make check arm64 |
|
jenkins test make check |
This patch introduces several improvements to the Prometheus module:
- Introduces `HealthHistory._prune()` to drop stale and inactive health checks.
Limits the in-memory healthcheck dict to a configurable max_entries (default 1000).
TTL for stale entries is configurable via `healthcheck_history_stale_ttl` (default 3600s).
- Refactors HealthHistory.check() to use a unified iteration over known and current checks,
improving concurrency and minimizing redundant updates.
- Use cherrypy.tools.gzip instead of manual gzip.compress() for cleaner
HTTP compression with proper header handling and client negotiation.
- Introduces new module options:
- `healthcheck_history_max_entries`
- Add proper error handling for CherryPy engine startup failures
- Remove os._exit monkey patch in favor of proper exception handling
- Remove manual Content-Type header setting (CherryPy handles automatically)
Fixes: https://tracker.ceph.com/issues/68989
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
Fix incorrect reference counting and memory retention behavior in TTLCache when storing PyObject* values. Previously, TTLCache::insert did not increment the reference count, and `erase` / `clear` did not correctly decref the values, leading to use-after-free or leaks depending on usage. Changes: - Move Py_INCREF from cacheable_get_python() to TTLCache::insert() - Add `TTLCache::clear()` method for proper memory cleanup - Ensure TTLCache::get() returns a new reference - Fix misuse of std::move on c_str() in PyJSONFormatter These changes prevent both memory leaks and use-after-free errors when mgr modules use cached Python objects logic. Fixes: https://tracker.ceph.com/issues/68989 Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
e2bb649 to
7fadf6a
Compare
|
Hey @NitzanMordhai can you TAL at these two new issues? https://tracker.ceph.com/issues/74148 Somehow they were missed in the QA review. For sure the second issue is related, and I'm pretty sure the first is as well. |
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
PR ceph#65245 drop the header set for standby module, we should still have it. Fixes: https://tracker.ceph.com/issues/74149 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and
erase/cleardid not correctly decref the values, leadingto use-after-free or leaks depending on usage.
Changes:
TTLCache::clear()method for proper memory cleanupThese changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.
Fixes: https://tracker.ceph.com/issues/68989
Signed-off-by: Nitzan Mordechai nmordech@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition