fix(gateway): recover from stale pid files and close cron agents by bloodcarter · Pull Request #13979 · NousResearch/hermes-agent

bloodcarter · 2026-04-22T11:59:49Z

Bug Description

Two regressions were keeping the gateway from surviving long runs:

Stale PID file blocks restart. After a crashy exit (anything that skipped atexit), the next gateway run --replace exits with:
```
ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
```
systemd's Restart=on-failure loops forever because the stale pid file is never unlinked.
EMFILE after a couple of days. The original EMFILE fix closed ephemeral AIAgent instances in the gateway; the companion PR fix(gateway): recover from stale pid files and close cron agents #13979 (original scope) cleaned the process-global auxiliary async-client cache on gateway teardown / per agent turn. Both miss the cron path: every 10-minute tick builds a fresh AIAgent + async httpx clients in a new ThreadPoolExecutor worker loop, and then drops everything on the floor when the pool shuts down. neuter_async_httpx_del disables __del__, so the transports GC with their sockets still open.

Root Cause

gateway/status.py::_cleanup_invalid_pid_path delegates to remove_pid_file(), whose "only unlink if the pid is mine" safety check (meant for the --replace atexit handoff where the new process may already have overwritten the file) also runs on stale-record cleanup. At the callsite in get_running_pid, we've already verified the record points at a dead process — the safety check is inapplicable and prevents recovery.
cron/scheduler.py::run_job never calls agent.close() and never reaps _client_cache. In a long-running gateway that ticks every N minutes, each run leaks subprocess handles (ProcessRegistry), the agent's OpenAI/httpx client, plus one set of async auxiliary httpx transports per cached-loop-miss.

Fix

Gateway PID recovery

_cleanup_invalid_pid_path unlinks unconditionally when cleanup_stale=True; the safety check stays on remove_pid_file for its actual purpose (atexit during handoff).
Regression test simulating a crashed-process record in tests/gateway/test_status.py::test_get_running_pid_cleans_stale_record_from_dead_process.

Cron FD cleanup

Outer finally in run_job now:
- calls agent.close() (ports the cleanup from fix: close ephemeral agents to prevent fd leaks #12998 into this PR — subprocesses, terminal sandbox, browser daemon, OpenAI client)
- calls agent.auxiliary_client.cleanup_stale_async_clients() to drop cache entries whose worker-thread loop died with the ThreadPoolExecutor.
Regression tests in tests/cron/test_scheduler.py:
- test_run_job_closes_agent_on_failure_to_prevent_fd_leak
- test_run_job_reaps_stale_auxiliary_clients_per_tick

Combined with the existing per-turn cleanup in GatewayRunner._cleanup_agent_resources (earlier commit on this branch) and shutdown_cached_clients() on stop, auxiliary clients are now reaped everywhere they're created.

How to Verify

rm $HERMES_HOME/gateway.pid on a dead gateway, then systemctl --user restart hermes-gateway.service — gateway should claim the pid file rather than looping on "PID file race lost". Same recovery applies when the prior process crashed without atexit.
Leave the gateway running with a frequently-ticking cron job (e.g. every 10m). Track fd count with ls /proc/<pid>/fd | wc -l over multiple ticks — it should stay flat instead of growing monotonically.

Test Plan

python -m pytest tests/gateway/test_status.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_runner_startup_failures.py -q — 43 passed
python -m pytest tests/cron/test_scheduler.py -q -o addopts="" — 90 passed (xdist-parallel flake in TestSilentDelivery / dingtalk / matrix is pre-existing on main)
Restored gateway on a machine that was stuck in the PID-race loop; confirmed it starts and a --replace cycle is clean.

Risk Assessment

Low.

The pid-cleanup change only runs after the caller has already classified the record as stale; the atexit safety check is preserved on remove_pid_file itself.
The cron cleanup mirrors the gateway's existing per-turn path and runs in a broad except Exception so a misbehaving close() cannot break an otherwise-successful job.

Subsumes #12998 (that PR's agent.close() in the cron scheduler is included here and extended with auxiliary-client reaping).

Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup.

…alvage #13979) (#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>

teknium1 · 2026-04-27T14:42:25Z

Salvaged + merged as #16598. Thanks @bloodcarter — your cron and aux-client-cleanup fix is now on main. Closing this in favor of the salvaged branch which also resolves a couple of conflicts with the recently-extracted _kill_tool_subprocesses helper.

…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>

fix: clean gateway auxiliary client caches on teardown

82c2aaa

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 22, 2026

bloodcarter changed the title ~~fix: clean gateway auxiliary client caches on teardown~~ fix(gateway): recover from stale pid files and close cron agents Apr 23, 2026

bloodcarter mentioned this pull request Apr 23, 2026

fix: close ephemeral agents to prevent fd leaks #12998

Closed

teknium1 mentioned this pull request Apr 27, 2026

fix(gateway,cron): close ephemeral agents + reap stale aux clients (salvage #13979) #16598

Merged

teknium1 closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): recover from stale pid files and close cron agents#13979

fix(gateway): recover from stale pid files and close cron agents#13979
bloodcarter wants to merge 2 commits into
NousResearch:mainfrom
bloodcarter:fix/gateway-aux-client-fd-cleanup

bloodcarter commented Apr 22, 2026 •

edited

Loading

Uh oh!

teknium1 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bloodcarter commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Description

Root Cause

Fix

How to Verify

Test Plan

Risk Assessment

Uh oh!

teknium1 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bloodcarter commented Apr 22, 2026 •

edited

Loading