Skip to content

Port from cline/cline#10343: periodic gateway memory logging#17667

Closed
teknium1 wants to merge 1 commit into
mainfrom
cline-port/gateway-memory-monitor
Closed

Port from cline/cline#10343: periodic gateway memory logging#17667
teknium1 wants to merge 1 commit into
mainfrom
cline-port/gateway-memory-monitor

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Gateway logs [MEMORY] rss=...MB gc=... threads=... uptime=...s to agent.log / gateway.log every 5 minutes so slow leaks in the long-lived process show up as a time series.

Ported from cline/cline#10343 (src/standalone/memory-monitor.ts). Their cline-core Node process and our gateway are the same shape of problem — a long-running autonomous-agent backend where a leak in any of caching / sessions / MCP / memory provider is invisible until you watch RSS climb for hours.

Changes

  • gateway/memory_monitor.py (new): daemon thread that logs a baseline on start, periodic snapshots at interval_seconds, and a final [MEMORY] shutdown ... line on stop. Uses resource.getrusage() (stdlib, Linux/macOS) first, falls back to psutil (already an optional dep via mcp_tool.py), disables itself with one WARNING if neither works.
  • gateway/run.py (~12010, ~12200): start right after setup_logging(), stop next to shutdown_mcp_servers(). Gated on logging.memory_monitor.enabled (default true) and wrapped in best-effort try/except so a monitor failure can never break gateway startup.
  • hermes_cli/config.py: new logging.memory_monitor block — enabled: true, interval_seconds: 300.
  • tests/gateway/test_memory_monitor.py: 10 targeted unit tests.

Adaptation notes (vs. the upstream TS port)

  • Node setInterval + .unref() → Python threading.Thread(daemon=True) driven by a threading.Event.wait() so shutdown is immediate instead of waiting for the next tick.
  • Log line includes gc=(gen0,gen1,gen2) and threads=N instead of V8's external/arrayBuffers — more useful for Python leaks (thread leaks + GC pressure are the common gateway failure modes).
  • No Node --heapsnapshot-near-heap-limit equivalent. CPython's closest analogue is tracemalloc, which has non-trivial steady-state overhead; deferring that to a separate PR if someone asks for it.
  • Config-gated: logging.memory_monitor.enabled: false silences the line entirely, matching other diagnostic toggles under logging:.

Validation

$ bash scripts/run_tests.sh tests/gateway/test_memory_monitor.py -v
============================== 10 passed in 1.27s ==============================

Sample log output (interval 0.3s for the smoke run; rounds to 0s in the "started" line which is cosmetic at sub-second intervals, not an issue at the 300s default):

[MEMORY] baseline rss=28MB gc=(549, 2, 3) threads=1 uptime=0s
[MEMORY] Periodic memory monitoring started (interval: 300s)
[MEMORY] rss=28MB gc=(590, 2, 3) threads=2 uptime=300s
[MEMORY] rss=29MB gc=(591, 2, 3) threads=2 uptime=600s
[MEMORY] shutdown rss=29MB gc=(594, 2, 3) threads=2 uptime=903s
[MEMORY] Periodic memory monitoring stopped

Grep-friendly: grep '\[MEMORY\] rss=' ~/.hermes/logs/gateway.log | awk '{print $1,$2,$4}' gives a quick "RSS over time" view.

Context

Hermes has a memory-leak-audit skill and the gateway is a known long-running process that caches agent instances, session transcripts, MCP connections, tool schemas, and memory providers. This adds the basic instrumentation a leak audit starts from — without it, every audit has to recommend the user add temporary ps logging first.

Emit a grep-friendly '[MEMORY] rss=...MB ...' line in agent.log /
gateway.log every N minutes (default 5) so slow leaks in the long-lived
gateway process show up as a time series. Based on
cline/cline#10343
(src/standalone/memory-monitor.ts).

- gateway/memory_monitor.py: new module. Daemon thread, baseline on
  start, final snapshot on stop. Uses resource.getrusage() (stdlib)
  first, falls back to psutil, disables itself with one WARNING if
  neither is available.
- gateway/run.py: start monitor right after setup_logging() in
  start_gateway(); stop it in the shutdown block next to MCP teardown.
- hermes_cli/config.py: logging.memory_monitor { enabled, interval_seconds }
  defaults under the existing logging section.
- tests/gateway/test_memory_monitor.py: 10 unit tests covering format,
  baseline/shutdown snapshots, double-start noop, periodic timer,
  daemon thread invariant, and unavailable-RSS warn-and-skip path.

Adapted from TypeScript/Node to Python (threading.Event-based daemon
thread instead of setInterval/unref), added Python-specific gc + thread
counts to the log line (handier than ext/arrayBuffers for diagnosing
Python gateway leaks), and gated behind a config.yaml toggle so users
can silence the periodic line if they want.

No heap-snapshot-on-OOM equivalent — CPython doesn't have V8's
--heapsnapshot-near-heap-limit; tracemalloc would be the Python
equivalent but adds non-trivial overhead, so leaving that out.
@alt-glitch alt-glitch added type/feature New feature or request comp/gateway Gateway runner, session dispatch, delivery area/config Config system, migrations, profiles P3 Low — cosmetic, nice to have labels Apr 30, 2026
@teknium1

Copy link
Copy Markdown
Contributor Author

Closing in favor of #27102, which salvages this PR onto current main. Clean cherry-pick — all new files. 10 new tests pass, 5506 gateway regression tests pass, E2E smoke run confirms baseline + periodic + shutdown [MEMORY] lines all emit with RSS, GC, threads, and uptime populated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants