Skip to content

fix(gateway): recover from stale pid files and close cron agents#13979

Closed
bloodcarter wants to merge 2 commits into
NousResearch:mainfrom
bloodcarter:fix/gateway-aux-client-fd-cleanup
Closed

fix(gateway): recover from stale pid files and close cron agents#13979
bloodcarter wants to merge 2 commits into
NousResearch:mainfrom
bloodcarter:fix/gateway-aux-client-fd-cleanup

Conversation

@bloodcarter

@bloodcarter bloodcarter commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Bug Description

Two regressions were keeping the gateway from surviving long runs:

  1. Stale PID file blocks restart. After a crashy exit (anything that skipped atexit), the next gateway run --replace exits with:

    ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
    

    systemd's Restart=on-failure loops forever because the stale pid file is never unlinked.

  2. EMFILE after a couple of days. The original EMFILE fix closed ephemeral AIAgent instances in the gateway; the companion PR fix(gateway): recover from stale pid files and close cron agents #13979 (original scope) cleaned the process-global auxiliary async-client cache on gateway teardown / per agent turn. Both miss the cron path: every 10-minute tick builds a fresh AIAgent + async httpx clients in a new ThreadPoolExecutor worker loop, and then drops everything on the floor when the pool shuts down. neuter_async_httpx_del disables __del__, so the transports GC with their sockets still open.

Root Cause

  1. gateway/status.py::_cleanup_invalid_pid_path delegates to remove_pid_file(), whose "only unlink if the pid is mine" safety check (meant for the --replace atexit handoff where the new process may already have overwritten the file) also runs on stale-record cleanup. At the callsite in get_running_pid, we've already verified the record points at a dead process — the safety check is inapplicable and prevents recovery.

  2. cron/scheduler.py::run_job never calls agent.close() and never reaps _client_cache. In a long-running gateway that ticks every N minutes, each run leaks subprocess handles (ProcessRegistry), the agent's OpenAI/httpx client, plus one set of async auxiliary httpx transports per cached-loop-miss.

Fix

Gateway PID recovery

  • _cleanup_invalid_pid_path unlinks unconditionally when cleanup_stale=True; the safety check stays on remove_pid_file for its actual purpose (atexit during handoff).
  • Regression test simulating a crashed-process record in tests/gateway/test_status.py::test_get_running_pid_cleans_stale_record_from_dead_process.

Cron FD cleanup

  • Outer finally in run_job now:
    • calls agent.close() (ports the cleanup from fix: close ephemeral agents to prevent fd leaks #12998 into this PR — subprocesses, terminal sandbox, browser daemon, OpenAI client)
    • calls agent.auxiliary_client.cleanup_stale_async_clients() to drop cache entries whose worker-thread loop died with the ThreadPoolExecutor.
  • Regression tests in tests/cron/test_scheduler.py:
    • test_run_job_closes_agent_on_failure_to_prevent_fd_leak
    • test_run_job_reaps_stale_auxiliary_clients_per_tick

Combined with the existing per-turn cleanup in GatewayRunner._cleanup_agent_resources (earlier commit on this branch) and shutdown_cached_clients() on stop, auxiliary clients are now reaped everywhere they're created.

How to Verify

  1. rm $HERMES_HOME/gateway.pid on a dead gateway, then systemctl --user restart hermes-gateway.service — gateway should claim the pid file rather than looping on "PID file race lost". Same recovery applies when the prior process crashed without atexit.
  2. Leave the gateway running with a frequently-ticking cron job (e.g. every 10m). Track fd count with ls /proc/<pid>/fd | wc -l over multiple ticks — it should stay flat instead of growing monotonically.

Test Plan

  • python -m pytest tests/gateway/test_status.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_runner_startup_failures.py -q — 43 passed
  • python -m pytest tests/cron/test_scheduler.py -q -o addopts="" — 90 passed (xdist-parallel flake in TestSilentDelivery / dingtalk / matrix is pre-existing on main)
  • Restored gateway on a machine that was stuck in the PID-race loop; confirmed it starts and a --replace cycle is clean.

Risk Assessment

Low.

  • The pid-cleanup change only runs after the caller has already classified the record as stale; the atexit safety check is preserved on remove_pid_file itself.
  • The cron cleanup mirrors the gateway's existing per-turn path and runs in a broad except Exception so a misbehaving close() cannot break an otherwise-successful job.

Subsumes #12998 (that PR's agent.close() in the cron scheduler is included here and extended with auxiliary-client reaping).

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 22, 2026
Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.
@bloodcarter bloodcarter changed the title fix: clean gateway auxiliary client caches on teardown fix(gateway): recover from stale pid files and close cron agents Apr 23, 2026
teknium1 added a commit that referenced this pull request Apr 27, 2026
…alvage #13979) (#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged + merged as #16598. Thanks @bloodcarter — your cron and aux-client-cleanup fix is now on main. Closing this in favor of the salvaged branch which also resolves a couple of conflicts with the recently-extracted _kill_tool_subprocesses helper.

@teknium1 teknium1 closed this Apr 27, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants