Skip to content

fix: use RssAnon instead of VmRSS for watchdog threshold#1004

Closed
garrytan-agents wants to merge 2 commits into
garrytan:masterfrom
garrytan-agents:fix/rss-anon-watchdog
Closed

fix: use RssAnon instead of VmRSS for watchdog threshold#1004
garrytan-agents wants to merge 2 commits into
garrytan:masterfrom
garrytan-agents:fix/rss-anon-watchdog

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

process.memoryUsage().rss returns VmRSS which includes file-backed mmap pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB.

The kernel reclaims file-backed pages under memory pressure — they are cache, not real usage. But the watchdog sees 7GB and triggers gracefulShutdown() every single autopilot cycle.

Measurement

Metric During git sync At idle
process.memoryUsage().rss (VmRSS) 7,023 MB 90 MB
RssAnon (actual heap) ~200 MB 48 MB
RssFile (file cache) ~6,800 MB 42 MB

Fix

Replace process.memoryUsage().rss with a getAccurateRss() helper that reads /proc/self/status for RssAnon + RssShmem — the anonymous pages that represent actual memory allocation.

Falls back to process.memoryUsage().rss on non-Linux (macOS, Windows).

Before: Watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold)
After: Watchdog only triggers on real memory growth (~100MB << 4GB threshold)

Related

Wintermute added 2 commits May 14, 2026 23:13
The RSS watchdog triggers gracefulShutdown() which exits with code 0.
The supervisor was counting ALL exits < 5min as crashes, including
clean code=0 exits. After 10 watchdog-triggered restarts (typical with
a 96K-page brain where autopilot inflates RSS), the supervisor gave up
with max_crashes_exceeded.

Fix: code=0 exits reset crashCount to 0 and restart immediately with
no backoff. Only code≠0 exits count toward the crash limit.

Root cause: process.memoryUsage().rss reports 7GB during autopilot
sync on large repos (possibly shared page inflation from git mmap).
The 4096MB threshold triggers on every cycle. This is a separate
issue (RSS measurement accuracy) but the supervisor should handle
clean exits regardless.
process.memoryUsage().rss returns VmRSS which includes file-backed
mmap'd pages. On repos with large git packfiles (96K+ pages), git
operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB.
The kernel reclaims these pages under memory pressure — they're cache.

Replace with /proc/self/status RssAnon + RssShmem which measures only
anonymous pages (heap, stack, anonymous mmap). This is the memory that
actually matters for OOM risk.

Falls back to process.memoryUsage().rss on non-Linux.

Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold)
After:  watchdog only triggers on real memory growth (~100MB << 4GB threshold)

Related: garrytan#1002 (supervisor crash-count fix for the same symptom)
@garrytan

Copy link
Copy Markdown
Owner

Superseded by #1003 fix wave.

The supervisor commit (d741574) from this PR duplicated #1003's existing commit. The RssAnon watchdog fix (b81c598) has been cherry-picked into #1003 (now landed as dab48cd on that branch).

During plan-eng-review on #1003, the fix wave grew to also:

  • preserve flap detection via lastExitCode tracking instead of resetting crashCount on every code=0 exit
  • add a clean-restart budget for the macOS / kernel-<4.5 fallback path
  • refactor autopilot.ts to share the supervisor core (DRY — Codex caught that autopilot had a parallel implementation with the same bug class)
  • fix a field-presence parser bug Codex surfaced in getAccurateRss

Co-Authored-By trailers preserved on the cherry-pick. Thank you for the root-cause diagnosis — RssAnon + RssShmem is the right metric for per-process leak detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants