Skip to content

fix(gateway,cron): close ephemeral agents + reap stale aux clients (salvage #13979)#16598

Merged
teknium1 merged 3 commits into
mainfrom
salvage/cron-gateway-fd-cleanup-13979
Apr 27, 2026
Merged

fix(gateway,cron): close ephemeral agents + reap stale aux clients (salvage #13979)#16598
teknium1 merged 3 commits into
mainfrom
salvage/cron-gateway-fd-cleanup-13979

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Salvages #13979 (@bloodcarter) — closes #14209, #14210, #14368.

Problem

macOS gateway hits OSError: [Errno 24] Too many open files after ~4 days and degrades across Telegram, cron, .env loads, dynamic imports, and outbound LLM/httpx calls. Only a gateway restart recovers.

Two root causes, both missing the existing per-turn cleanup path:

  1. Cron scheduler leaks its ephemeral AIAgent per tick. cron/scheduler.py::run_job spawns a fresh AIAgent + AsyncOpenAI client inside a new ThreadPoolExecutor worker thread (new event loop). The finally block only restored TERMINAL_CWD and closed SessionDB — never called agent.close() and never reaped auxiliary_client._client_cache. Every tick leaks 1 KQUEUE fd, 2 unix socket fds (self-pipe pair), plus N httpx connection-pool fds. With macOS launchd's default RLIMIT_NOFILE=256, a gateway with 6 daily cron jobs hits EMFILE inside ~4 days — matching the reporter's timeline.

  2. Gateway _cleanup_agent_resources didn't reap the stale-loop cache. The process-global _client_cache FIFO-evicts at 64 entries (PR fix: bound auxiliary client cache to prevent fd exhaustion in long-running gateways #10470), but entries bound to a worker-thread loop that died with its ThreadPoolExecutor sit there until shutdown. Per-turn cleanup needs to call cleanup_stale_async_clients() so dead-loop entries are reaped between turns.

Fix

Cron FD cleanup (cron/scheduler.py, +20): adds agent.close() and cleanup_stale_async_clients() in run_job's outer finally, mirroring the gateway's per-turn cleanup.

Gateway per-turn cleanup (gateway/run.py, +9): _cleanup_agent_resources now calls cleanup_stale_async_clients() after agent.close(). Final-cleanup block calls shutdown_cached_clients() once (moved out of the _kill_tool_subprocesses helper so the 2× invocation from the drain-timeout path doesn't matter).

Tests

  • tests/cron/test_scheduler.py — success-path agent close assertion, test_run_job_closes_agent_on_failure_to_prevent_fd_leak, test_run_job_reaps_stale_auxiliary_clients_per_tick
  • tests/gateway/test_gateway_shutdown.pytest_cleanup_agent_resources_reaps_stale_aux_clients, shutdown_cached_clients asserted on stop
  • Targeted: tests/cron/test_scheduler.py + tests/gateway/test_status.py + tests/gateway/test_gateway_shutdown.py136 passed locally (CI-parity via scripts/run_tests.sh).

Salvage notes vs original #13979

  • Rebased onto current origin/main (conflicts resolved; main had already extracted _kill_tool_subprocesses and landed the pid-path fix separately — PR's pid change is a no-op on current tree but left in place for intent parity).
  • Moved shutdown_cached_clients() out of the per-phase helper so the double-call from the drain-timeout path doesn't trip test assertions (now invoked once, right after _kill_tool_subprocesses("final-cleanup")).
  • Preserved bloodcarter authorship on both fix commits; added AUTHOR_MAP entry in a separate chore(release): commit.

Closes #14209
Closes #14210
Closes #14368

bloodcarter and others added 3 commits April 27, 2026 07:36
Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.
@teknium1 teknium1 merged commit 9b55365 into main Apr 27, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the salvage/cron-gateway-fd-cleanup-13979 branch April 27, 2026 14:41
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…alvage NousResearch#13979) (NousResearch#16598)

* fix: clean gateway auxiliary client caches on teardown

* fix(gateway): recover from stale pid files and close cron agents

Two issues were keeping the gateway from surviving long runs:

1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which
   refuses to unlink when the file's pid differs from our own. That
   safety check exists for the --replace atexit handoff, but it also
   applied to stale-record cleanup, so after a crashy exit the pid
   file was orphaned: `write_pid_file()`'s O_EXCL create then failed
   with `FileExistsError`, and systemd looped on "PID file race lost
   to another gateway instance". Unlink unconditionally from this
   helper since the caller has already verified the record is dead.

2. The cron scheduler never closed the ephemeral `AIAgent` it creates
   per tick, and never swept the process-global auxiliary-client
   cache. Over days of 10-minute ticks this leaked subprocesses and
   async httpx transports until the gateway hit EMFILE. Release the
   agent and call `cleanup_stale_async_clients()` in `run_job`'s
   outer `finally`, matching the gateway's own per-turn cleanup.

* chore(release): map bloodcarter@gmail.com -> bloodcarter

---------

Co-authored-by: bloodcarter <bloodcarter@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants