fix(gateway): recover from stale pid files and close cron agents#13979
Closed
bloodcarter wants to merge 2 commits into
Closed
fix(gateway): recover from stale pid files and close cron agents#13979bloodcarter wants to merge 2 commits into
bloodcarter wants to merge 2 commits into
Conversation
Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup.
teknium1
added a commit
that referenced
this pull request
Apr 27, 2026
…alvage #13979) (#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
Contributor
|
Salvaged + merged as #16598. Thanks @bloodcarter — your cron and aux-client-cleanup fix is now on main. Closing this in favor of the salvaged branch which also resolves a couple of conflicts with the recently-extracted |
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
donald131
pushed a commit
to donald131/hermes-agent
that referenced
this pull request
May 2, 2026
…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
dannyJ848
pushed a commit
to dannyJ848/hermes-agent
that referenced
this pull request
May 17, 2026
…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…alvage NousResearch#13979) (NousResearch#16598) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug Description
Two regressions were keeping the gateway from surviving long runs:
Stale PID file blocks restart. After a crashy exit (anything that skipped
atexit), the nextgateway run --replaceexits with:systemd's
Restart=on-failureloops forever because the stale pid file is never unlinked.EMFILE after a couple of days. The original EMFILE fix closed ephemeral
AIAgentinstances in the gateway; the companion PR fix(gateway): recover from stale pid files and close cron agents #13979 (original scope) cleaned the process-global auxiliary async-client cache on gateway teardown / per agent turn. Both miss the cron path: every 10-minute tick builds a freshAIAgent+ async httpx clients in a newThreadPoolExecutorworker loop, and then drops everything on the floor when the pool shuts down.neuter_async_httpx_deldisables__del__, so the transports GC with their sockets still open.Root Cause
gateway/status.py::_cleanup_invalid_pid_pathdelegates toremove_pid_file(), whose "only unlink if the pid is mine" safety check (meant for the--replaceatexit handoff where the new process may already have overwritten the file) also runs on stale-record cleanup. At the callsite inget_running_pid, we've already verified the record points at a dead process — the safety check is inapplicable and prevents recovery.cron/scheduler.py::run_jobnever callsagent.close()and never reaps_client_cache. In a long-running gateway that ticks every N minutes, each run leaks subprocess handles (ProcessRegistry), the agent's OpenAI/httpx client, plus one set of async auxiliary httpx transports per cached-loop-miss.Fix
Gateway PID recovery
_cleanup_invalid_pid_pathunlinks unconditionally whencleanup_stale=True; the safety check stays onremove_pid_filefor its actual purpose (atexit during handoff).tests/gateway/test_status.py::test_get_running_pid_cleans_stale_record_from_dead_process.Cron FD cleanup
finallyinrun_jobnow:agent.close()(ports the cleanup from fix: close ephemeral agents to prevent fd leaks #12998 into this PR — subprocesses, terminal sandbox, browser daemon, OpenAI client)agent.auxiliary_client.cleanup_stale_async_clients()to drop cache entries whose worker-thread loop died with theThreadPoolExecutor.tests/cron/test_scheduler.py:test_run_job_closes_agent_on_failure_to_prevent_fd_leaktest_run_job_reaps_stale_auxiliary_clients_per_tickCombined with the existing per-turn cleanup in
GatewayRunner._cleanup_agent_resources(earlier commit on this branch) andshutdown_cached_clients()on stop, auxiliary clients are now reaped everywhere they're created.How to Verify
rm $HERMES_HOME/gateway.pidon a dead gateway, thensystemctl --user restart hermes-gateway.service— gateway should claim the pid file rather than looping on "PID file race lost". Same recovery applies when the prior process crashed without atexit.ls /proc/<pid>/fd | wc -lover multiple ticks — it should stay flat instead of growing monotonically.Test Plan
python -m pytest tests/gateway/test_status.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_runner_startup_failures.py -q— 43 passedpython -m pytest tests/cron/test_scheduler.py -q -o addopts=""— 90 passed (xdist-parallel flake inTestSilentDelivery/ dingtalk / matrix is pre-existing onmain)--replacecycle is clean.Risk Assessment
Low.
remove_pid_fileitself.except Exceptionso a misbehavingclose()cannot break an otherwise-successful job.Subsumes #12998 (that PR's
agent.close()in the cron scheduler is included here and extended with auxiliary-client reaping).