Skip to content

fix(tools): escalate SIGTERM→SIGKILL on browser daemon + periodic orphan reap#15008

Open
tkwong wants to merge 1 commit into
NousResearch:mainfrom
tkwong:wip/browser-daemon-hard-kill
Open

fix(tools): escalate SIGTERM→SIGKILL on browser daemon + periodic orphan reap#15008
tkwong wants to merge 1 commit into
NousResearch:mainfrom
tkwong:wip/browser-daemon-hard-kill

Conversation

@tkwong

@tkwong tkwong commented Apr 24, 2026

Copy link
Copy Markdown

Summary

Builds on #11843 (owner_pid cross-process tracking). Adds:

  1. _terminate_browser_daemon(): SIGTERM with 2s grace, escalate to SIGKILL if the daemon ignores SIGTERM. Verifies exit before returning True; returns False on survival so the socket dir is left in place for the next reap attempt.
  2. _pid_exists(): helper that distinguishes alive / dead / foreign-owned via os.kill(pid, 0) + PermissionError.
  3. Periodic orphan reap every 5 min from the cleanup thread, not just once at startup + atexit. Catches daemons orphaned mid-session.
  4. Tunable via BROWSER_DAEMON_TERM_GRACE_SECONDS (default 2.0) and BROWSER_ORPHAN_REAP_INTERVAL_SECONDS (default 300).

Motivation

Agent-browser daemons with SIGTERM handlers that stall in cleanup survive reap. Without SIGKILL escalation they leak indefinitely until machine reboot — observed in a long-running gateway hitting EMFILE.

Composes with #11843 (owner_pid)

Orthogonal concerns — owner_pid decides whether to reap, this PR strengthens how the reap executes. Both live in _reap_orphaned_browser_sessions; conflict resolution kept both behaviors.

Test plan

  • pytest tests/tools/test_browser_cleanup.py tests/tools/test_browser_orphan_reaper.py tests/gateway/test_whatsapp_connect.py → 49/49 pass
  • Updated test_dead_owner_triggers_reap mock to model a SIGTERM-honoring daemon (old mock modeled zombie daemon that never dies, which is now the unhappy path under this PR — socket dir retention is expected)
  • Runtime smoke: gateway restarted on this branch, no EMFILE, no orphan accumulation across reap cycle

🤖 Generated with Claude Code

…han reap

Builds on NousResearch#11843 (owner_pid cross-process tracking). Adds:

1. _terminate_browser_daemon(): SIGTERM with 2s grace, escalate to SIGKILL
   if daemon ignores SIGTERM. Verifies exit before returning True; returns
   False on survival so socket dir is left for next reap attempt.
2. _pid_exists(): helper that distinguishes alive / dead / foreign-owned.
3. Periodic orphan reap every 5 min from cleanup thread (not just once
   at startup + atexit). Catches daemons orphaned mid-session.
4. Tunable via BROWSER_DAEMON_TERM_GRACE_SECONDS (default 2.0) and
   BROWSER_ORPHAN_REAP_INTERVAL_SECONDS (default 300).

Motivation: daemons with SIGTERM handlers that stall in cleanup survive
reap. Without escalation, they leak indefinitely until machine reboot.

Tests: 49/49 pass.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/browser Browser automation (CDP, Playwright) comp/tools Tool registry, model_tools, toolsets labels Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/browser Browser automation (CDP, Playwright) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants