Skip to content

fix(browser): SIGKILL Chrome descendants when reaping orphaned daemons#17547

Closed
alexzhu0 wants to merge 1 commit into
NousResearch:mainfrom
alexzhu0:fix/browser-orphan-chrome-cleanup
Closed

fix(browser): SIGKILL Chrome descendants when reaping orphaned daemons#17547
alexzhu0 wants to merge 1 commit into
NousResearch:mainfrom
alexzhu0:fix/browser-orphan-chrome-cleanup

Conversation

@alexzhu0

Copy link
Copy Markdown
Contributor

Closes #17388.

Symptom

hermes-agent launches agent-browser (headless Chrome) for a task, the task completes, hermes dies/crashes/restarts — and days later, Chrome's renderer / GPU / helper processes are still running at 100% CPU. Reporter observed 7 Chrome processes consuming ~580% CPU for 3+ days after the owning hermes process was gone.

Root cause

_reap_orphaned_browser_sessions() in tools/browser_tool.py detects dead-owner daemons correctly, but when it reaps them it only sends SIGTERM to the daemon PID itself:

os.kill(daemon_pid, signal.SIGTERM)

The daemon spawned Chrome; Chrome spawned its renderer / GPU / network / storage helpers as its own children. When the daemon dies:

  • helpers that respect SIGTERM come down with it
  • helpers that ignore SIGTERM (renderers in certain states, stuck GPU helpers) get reparented to init and keep running forever

The reporter's table showed exactly that shape — GPU helper + 4 renderers + Network helper + Storage helper, each eating a whole CPU.

Fix

Add _descendant_pids(pid) and _kill_browser_tree(daemon_pid):

  • _descendant_pids walks the full process tree using psutil (cross-platform); falls back to a Linux /proc/*/status PPid walk when psutil is absent; returns empty set otherwise (caller degrades gracefully to the old SIGTERM-only behavior).
  • _kill_browser_tree:
    1. Collects descendants BEFORE signalling — once the daemon exits, init reparents helpers and the tree walk would miss them.
    2. SIGTERM daemon.
    3. Sleep 0.5s for graceful shutdown.
    4. SIGKILL any descendant still alive.

_reap_orphaned_browser_sessions now calls _kill_browser_tree instead of raw os.kill. Logs both the reap count and the force-kill straggler count so this class of bug becomes visible in journalctl -u hermes-gateway going forward.

Not in scope

  • Doesn't add a regression test. Simulating a dead-owner + live-descendants scenario requires forking real processes in CI; the existing _reap_orphaned_browser_sessions path has no unit tests for similar reasons. Happy to add an integration test in a follow-up if preferred.
  • Doesn't touch the agent-browser daemon's own shutdown semantics. The daemon does have AGENT_BROWSER_IDLE_TIMEOUT_MS self-termination (Endless Terminals Environment Integration #24), but that's the happy path; this PR covers the unhappy path where the owner crashed before idle fires.

File

  • tools/browser_tool.py: +101 / -1

Closes #17388.

_reap_orphaned_browser_sessions() previously sent SIGTERM only to the
agent-browser daemon PID.  The daemon spawns headless Chrome which
spawns its renderer / GPU / network / storage helper processes as its
own children.  When the daemon is reaped but those helpers ignore
SIGTERM (or miss it via reparenting to init), they keep running at
100% CPU indefinitely — the reporter observed 7 Chrome processes
surviving 3+ days after their hermes owner died.

Add a new _descendant_pids(pid) helper that walks the full process
tree via psutil (cross-platform) with a Linux /proc fallback, and a
_kill_browser_tree() that:

1. Collects descendants BEFORE signalling (so reparented helpers are
   captured — once the daemon exits init becomes their parent and the
   walk would miss them).
2. SIGTERMs the daemon.
3. Sleeps 0.5s for graceful shutdown.
4. SIGKILLs any descendant still alive.

Failure to signal any individual PID is silently tolerated
(ProcessLookupError / PermissionError / OSError).  The reaper logs
the count of force-killed stragglers so this bug becomes visible in
journalctl going forward.
@alt-glitch alt-glitch added type/bug Something isn't working tool/browser Browser automation (CDP, Playwright) P2 Medium — degraded but workaround exists labels Apr 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #15008 — same SIGTERM→SIGKILL escalation for browser daemon orphans. This PR extends #15008 with full process-tree kill via descendant walk.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #15008 — same SIGTERM→SIGKILL escalation for browser daemon orphans. This PR extends #15008 with full process-tree kill via descendant walk.

@alexzhu0

Copy link
Copy Markdown
Contributor Author

Thanks for the cross-reference, @alt-glitch. Quick note for maintainer visibility on how #15008 and this PR relate — they're orthogonal, not a stack:

Scenario #15008 #17547 (this)
Daemon process hangs in SIGTERM cleanup escalates to SIGKILL after grace
Daemon dies but Chrome helpers (renderer/GPU/network/storage) get reparented to init collects descendants before signalling, SIGTERMs tree, SIGKILLs survivors

Both touch _reap_orphaned_browser_sessions, so whichever lands second will need a small rebase. Diff as-is applies cleanly to current upstream/main; no hard dependency on #15008. Happy to rebase onto #15008 if it lands first — the _kill_browser_tree helper drops in on top of _terminate_browser_daemon naturally (the descendant-walk step is additive to whatever kill strategy targets the daemon itself).

@alexzhu0

Copy link
Copy Markdown
Contributor Author

Closing as part of post-mortem cleanup of an early-batch proactive audit that did not get review traction. The patch still applies if anyone wants to repurpose it. My contribution methodology has moved to alexzhu0/echo-agent — not pursuing this individual fix further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists tool/browser Browser automation (CDP, Playwright) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agent-browser fails to clean up headless Chrome processes, causing zombie processes that consume 100% CPU

2 participants