fix(browser): SIGKILL Chrome descendants when reaping orphaned daemons#17547
fix(browser): SIGKILL Chrome descendants when reaping orphaned daemons#17547alexzhu0 wants to merge 1 commit into
Conversation
Closes #17388. _reap_orphaned_browser_sessions() previously sent SIGTERM only to the agent-browser daemon PID. The daemon spawns headless Chrome which spawns its renderer / GPU / network / storage helper processes as its own children. When the daemon is reaped but those helpers ignore SIGTERM (or miss it via reparenting to init), they keep running at 100% CPU indefinitely — the reporter observed 7 Chrome processes surviving 3+ days after their hermes owner died. Add a new _descendant_pids(pid) helper that walks the full process tree via psutil (cross-platform) with a Linux /proc fallback, and a _kill_browser_tree() that: 1. Collects descendants BEFORE signalling (so reparented helpers are captured — once the daemon exits init becomes their parent and the walk would miss them). 2. SIGTERMs the daemon. 3. Sleeps 0.5s for graceful shutdown. 4. SIGKILLs any descendant still alive. Failure to signal any individual PID is silently tolerated (ProcessLookupError / PermissionError / OSError). The reaper logs the count of force-killed stragglers so this bug becomes visible in journalctl going forward.
1 similar comment
|
Thanks for the cross-reference, @alt-glitch. Quick note for maintainer visibility on how #15008 and this PR relate — they're orthogonal, not a stack:
Both touch |
|
Closing as part of post-mortem cleanup of an early-batch proactive audit that did not get review traction. The patch still applies if anyone wants to repurpose it. My contribution methodology has moved to alexzhu0/echo-agent — not pursuing this individual fix further. |
Closes #17388.
Symptom
hermes-agentlaunchesagent-browser(headless Chrome) for a task, the task completes, hermes dies/crashes/restarts — and days later, Chrome's renderer / GPU / helper processes are still running at 100% CPU. Reporter observed 7 Chrome processes consuming ~580% CPU for 3+ days after the owning hermes process was gone.Root cause
_reap_orphaned_browser_sessions()intools/browser_tool.pydetects dead-owner daemons correctly, but when it reaps them it only sendsSIGTERMto the daemon PID itself:The daemon spawned Chrome; Chrome spawned its renderer / GPU / network / storage helpers as its own children. When the daemon dies:
The reporter's table showed exactly that shape — GPU helper + 4 renderers + Network helper + Storage helper, each eating a whole CPU.
Fix
Add
_descendant_pids(pid)and_kill_browser_tree(daemon_pid):_descendant_pidswalks the full process tree using psutil (cross-platform); falls back to a Linux/proc/*/statusPPid walk when psutil is absent; returns empty set otherwise (caller degrades gracefully to the old SIGTERM-only behavior)._kill_browser_tree:SIGTERMdaemon.SIGKILLany descendant still alive._reap_orphaned_browser_sessionsnow calls_kill_browser_treeinstead of rawos.kill. Logs both the reap count and the force-kill straggler count so this class of bug becomes visible injournalctl -u hermes-gatewaygoing forward.Not in scope
_reap_orphaned_browser_sessionspath has no unit tests for similar reasons. Happy to add an integration test in a follow-up if preferred.AGENT_BROWSER_IDLE_TIMEOUT_MSself-termination (Endless Terminals Environment Integration #24), but that's the happy path; this PR covers the unhappy path where the owner crashed before idle fires.File
tools/browser_tool.py: +101 / -1