mgr: fix PyObject* refcounting in TTLCache and cleanup logic by NitzanMordhai · Pull Request #65245 · ceph/ceph

NitzanMordhai · 2025-08-26T15:15:20Z

Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and erase / clear did not correctly decref the values, leading
to use-after-free or leaks depending on usage.

Changes:

Move Py_INCREF from cacheable_get_python() to TTLCache::insert()
Add TTLCache::clear() method for proper memory cleanup
Ensure TTLCache::get() returns a new reference
Fix misuse of std::move on c_str() in PyJSONFormatter

These changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.

Fixes: https://tracker.ceph.com/issues/68989
Signed-off-by: Nitzan Mordechai nmordech@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

doc/mgr/prometheus.rst

epuertat

Looks mostly good, Nitzan! Just left a few comments and questions. Hope they're useful.

epuertat · 2025-08-27T15:07:33Z

src/pybind/mgr/prometheus/module.py

 import threading
 import time
 import enum
+import gc


Instead of bringing Python's explicit gc into play (which can be a risky guest), we could use Python's weakref to avoid increasing the ref counting on the Python side. For example, that healthcheck data structure could be a weakref.WeakValueDictionary object.

I thought that the functools cache helpers already used weak refs, but I was wrong. It's the non-standard cachetools library the one that uses it (for key values as it uses it for function memoization).

I added explicit gc.collect() to maintain low memory footprint in long-lived mgr sessions. That said, I’m open to exploring weakref-based caches or LRU strategies in a follow-up.

Invoking a gc.collect() in Python triggers a complete scan of all in-memory data, including deeply nested dicts/large collections. If we're not keeping complex data structures with circular references (where weakrefs are usually the best approach), there's no benefit in invoking gc.collect(). Ref-counting is enough to deallocate resources. Is this based on a real benchmarking?

I can remove the gc.collect() now after changing the refcounts in the Cache itself. According to the massif output, the memory that the mgr api no longer leaks that memory

epuertat · 2025-08-27T15:28:48Z

src/pybind/mgr/prometheus/module.py

+            # Sort by last_seen and remove oldest entries
+            sorted_checks = sorted(self.healthcheck.values(), key=lambda x: x.last_seen)
+            for check in sorted_checks[:len(self.healthcheck) - self.max_entries]:
+                self.healthcheck.pop(check.name, None)


Why not using functools LRU cache (with weakrefs)? With that we don't need TTL (weakrefs will be voided on the mgr C++ side), and that cache explicitly tackles the size count.

@epuertat sorry, i missed that one, can you please explain what you mean by that?

This is for historical health monitoring - the mgr C++ side gives us the current state, but we want to maintain history of health checks over time. We need to prevent the history from growing beyond max_entries, and currently we also expire old entries after x time.

I think we could get rid of the TTL and only maintain max_entries limit - is that what you mean? Or are you suggesting a different approach for the historical tracking itself?

Yeah, I was suggesting that we simplified that struct. What about a collections.deque? It's basically a fixed-size FIFO queue. I assume that the events are inserted in order (maybe I'm wrong here), so if that's true, then we don't need an expiration time, just the regular FIFO eviction.

epuertat · 2025-08-27T15:33:31Z

src/mgr/TTLCache.h

  TTLCache(uint16_t ttl_ = 0, uint16_t size = UINT16_MAX, float spread = 0.25)
      : TTLCacheBase<Key, PyObject*>(ttl_, size, spread) {}
-  ~TTLCache(){};
+  ~TTLCache(){ clear(); };


Should we implement this method for the base Cache instead?

@epuertat i tried to move it to base cache, the problem is that Cache is implemented with <key, value> and not <key, PyObject*>, that means we need to duplicate it with <key, PyObject*> or add check for value == PyObject* and that starting to be more complicated then implement it in the TTLCache<Key, PyObject*>

Oh, I thought that the base class was templated to support arbitrary key/value types.

NitzanMordhai · 2025-08-29T08:19:34Z

@epuertat Thanks a lot for reviewing! i fixed and repushed, please take a look

k0ste · 2025-08-29T08:23:41Z

doc/mgr/prometheus.rst

+These limits are configurable via the following runtime options:
+
+  ``mgr/prometheus/healthcheck_history_max_entries`` - the maximum number of health check events to retain in memory (default: 1000).
+  ``mgr/prometheus/heal thcheck_history_stale_ttl``  - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600).


Suggested change

``mgr/prometheus/heal thcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600).

``mgr/prometheus/healthcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600).

src/pybind/mgr/prometheus/module.py

epuertat

Everything else looks good!

epuertat · 2025-09-01T13:15:38Z

src/pybind/mgr/prometheus/module.py

+        with self.lock:
+            changes_made = False
+            names = set(self.healthcheck) | set(current_checks)
+
+            for name in names:
+                present = name in current_checks
+                check = self.healthcheck.get(name)
+                if check is None:
+                    if present:
+                        info = current_checks[name]
+                        self.healthcheck[name] = HealthCheckEvent(
+                            name=name,
+                            severity=info.get('severity'),
+                            first_seen=now,
+                            last_seen=now,
+                            count=1,
+                            active=True
+                        )
+                        changes_made = True
+
                    continue


Does this work?? Sorry for the misleading suggestion: I just realized that the self.healthcheck cache is used for random access (by check name), so the deque won't work, since it's index-based (does it?? 🤯 ).

We would need a mixture of map/dict with a queue, something like:

from collections import OrderedDict class LRUCacheDict(OrderedDict): def __init__(self, maxsize, *args, **kwargs): self.maxsize = maxsize super().__init__(*args, **kwargs) def __setitem__(self, key, value): if key in self: del self[key] # refresh position elif len(self) >= self.maxsize: self.popitem(last=False) # drop oldest super().__setitem__(key, value)

@epuertat ok, i'll redo the LRUCacheDict!

SrinivasaBharath · 2025-11-11T10:38:22Z

@NitzanMordhai - Please resolve check issues, and feel free to proceed with merging the PR.

NitzanMordhai · 2025-12-01T15:22:05Z

jenkins test make check arm64

NitzanMordhai · 2025-12-01T15:22:11Z

jenkins test make check

This patch introduces several improvements to the Prometheus module: - Introduces `HealthHistory._prune()` to drop stale and inactive health checks. Limits the in-memory healthcheck dict to a configurable max_entries (default 1000). TTL for stale entries is configurable via `healthcheck_history_stale_ttl` (default 3600s). - Refactors HealthHistory.check() to use a unified iteration over known and current checks, improving concurrency and minimizing redundant updates. - Use cherrypy.tools.gzip instead of manual gzip.compress() for cleaner HTTP compression with proper header handling and client negotiation. - Introduces new module options: - `healthcheck_history_max_entries` - Add proper error handling for CherryPy engine startup failures - Remove os._exit monkey patch in favor of proper exception handling - Remove manual Content-Type header setting (CherryPy handles automatically) Fixes: https://tracker.ceph.com/issues/68989 Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

Fix incorrect reference counting and memory retention behavior in TTLCache when storing PyObject* values. Previously, TTLCache::insert did not increment the reference count, and `erase` / `clear` did not correctly decref the values, leading to use-after-free or leaks depending on usage. Changes: - Move Py_INCREF from cacheable_get_python() to TTLCache::insert() - Add `TTLCache::clear()` method for proper memory cleanup - Ensure TTLCache::get() returns a new reference - Fix misuse of std::move on c_str() in PyJSONFormatter These changes prevent both memory leaks and use-after-free errors when mgr modules use cached Python objects logic. Fixes: https://tracker.ceph.com/issues/68989 Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>

ljflores · 2025-12-08T22:58:18Z

Hey @NitzanMordhai can you TAL at these two new issues?

https://tracker.ceph.com/issues/74148
https://tracker.ceph.com/issues/74149

Somehow they were missed in the QA review. For sure the second issue is related, and I'm pretty sure the first is as well.

The HealthHistory.check() method acquires the lock and then calls HealthHistory.save(), which also tries to acquire the same lock. With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock). Switch to RLock to allow nested acquisition by the same thread. PR ceph#65245 added the locks. Fixes: https://tracker.ceph.com/issues/74148 Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>