fix: use RssAnon instead of VmRSS for watchdog threshold#1004
Closed
garrytan-agents wants to merge 2 commits into
Closed
fix: use RssAnon instead of VmRSS for watchdog threshold#1004garrytan-agents wants to merge 2 commits into
garrytan-agents wants to merge 2 commits into
Conversation
added 2 commits
May 14, 2026 23:13
The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts (typical with a 96K-page brain where autopilot inflates RSS), the supervisor gave up with max_crashes_exceeded. Fix: code=0 exits reset crashCount to 0 and restart immediately with no backoff. Only code≠0 exits count toward the crash limit. Root cause: process.memoryUsage().rss reports 7GB during autopilot sync on large repos (possibly shared page inflation from git mmap). The 4096MB threshold triggers on every cycle. This is a separate issue (RSS measurement accuracy) but the supervisor should handle clean exits regardless.
process.memoryUsage().rss returns VmRSS which includes file-backed mmap'd pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB. The kernel reclaims these pages under memory pressure — they're cache. Replace with /proc/self/status RssAnon + RssShmem which measures only anonymous pages (heap, stack, anonymous mmap). This is the memory that actually matters for OOM risk. Falls back to process.memoryUsage().rss on non-Linux. Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold) After: watchdog only triggers on real memory growth (~100MB << 4GB threshold) Related: garrytan#1002 (supervisor crash-count fix for the same symptom)
5 tasks
Owner
|
Superseded by #1003 fix wave. The supervisor commit (d741574) from this PR duplicated #1003's existing commit. The RssAnon watchdog fix (b81c598) has been cherry-picked into #1003 (now landed as dab48cd on that branch). During plan-eng-review on #1003, the fix wave grew to also:
Co-Authored-By trailers preserved on the cherry-pick. Thank you for the root-cause diagnosis — RssAnon + RssShmem is the right metric for per-process leak detection. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
process.memoryUsage().rssreturns VmRSS which includes file-backed mmap pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB.The kernel reclaims file-backed pages under memory pressure — they are cache, not real usage. But the watchdog sees 7GB and triggers
gracefulShutdown()every single autopilot cycle.Measurement
process.memoryUsage().rss(VmRSS)RssAnon(actual heap)RssFile(file cache)Fix
Replace
process.memoryUsage().rsswith agetAccurateRss()helper that reads/proc/self/statusforRssAnon + RssShmem— the anonymous pages that represent actual memory allocation.Falls back to
process.memoryUsage().rsson non-Linux (macOS, Windows).Before: Watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold)
After: Watchdog only triggers on real memory growth (~100MB << 4GB threshold)
Related