fix: close leaked cron and gateway resources#8331
Conversation
- close temp AIAgent instances in cron and gateway helper flows - close background process pipe handles after local process exit - add regression coverage for resource cleanup paths
iRonin
left a comment
There was a problem hiding this comment.
Review: excellent fix, 3 minor suggestions
1. Repeated try/finally: agent.close() pattern (DRY)
The same pattern appears 5 times in gateway/run.py:
try:
return agent.run_conversation(...)
finally:
try:
agent.close()
except Exception:
passThis should be a context manager or helper to avoid copy-paste drift. Something like:
@contextmanager
def transient_agent(*args, **kwargs):
agent = AIAgent(*args, **kwargs)
try:
yield agent
finally:
try:
agent.close()
except Exception:
passThis would reduce 5 occurrences to 5 one-line invocations and make future changes (e.g., adding logging) a single edit.
2. _close_process_pipes dedup via is may miss subprocess.STDOUT cases
The dedup logic uses any(handle is existing for existing in handles) which catches cases where stderr and stdout are the same object, but subprocess.Popen(stderr=subprocess.STDOUT) creates a different situation — proc.stderr is None and stderr is redirected to stdout at the OS level. The current code would miss this and attempt to close the same fd twice (though close() on an already-closed fd is idempotent on most platforms, it can raise OSError on some). Consider checking getattr(proc, attr, None) is not None for each attr and then deduping by handle.fileno() instead of object identity.
3. cron/scheduler.py: agent = None sentinel is correct but worth documenting
The agent = None before the try block ensures the finally cleanup runs even if AIAgent(**kwargs) raises. This is correct but non-obvious — a one-line comment explaining why the sentinel exists would help future maintainers.
Overall
This is the more impactful PR of the two (#8073 and this one). The FD leak in _reader_loop is a real production issue for long-running gateways, and closing transient agents prevents the accumulation of httpx/OpenAI client connections. Recommend merging this one first.
- Add _transient_agent() context manager in gateway/run.py for one-shot agent usage (memory flush, /btw, /background) — guarantees close() - Add explicit try/finally: agent.close() for patterns that need post-construction setup (_print_fn, session_id access) - Add _close_process_pipes() with fd-based dedup to handle stderr=subprocess.STDOUT correctly (object identity misses this case) - Add agent = None sentinel + close() in cron/scheduler.py finally block so cleanup runs even if AIAgent(**kwargs) raises Improves on upstream PR NousResearch#8331 by: - DRYing 5 copy-paste try/finally blocks into a single context manager - Fixing pipe dedup to use fileno() instead of object identity - Adding explanatory comment for the sentinel pattern
|
Thanks — great suggestions. I pushed a follow-up that addresses the two low-cost/high-signal items directly:
I kept this pass intentionally small and avoided turning it into a broader refactor/context-manager change, but I agree with the DRY direction if we want to take that further later. |
Bug Description
Hermes gateway could hit
OSError: [Errno 24] Too many open filesduring long-lived cron/gateway operation on macOS launchd setups with a lowRLIMIT_NOFILE(for example 256).Observed symptoms included:
tools.terminal_tool,tools.code_execution_tool, andwrite_filefailing withToo many open files~/.hermes/.envfailing inside later cron runsRoot Cause
There were two resource-lifecycle problems contributing to descriptor pressure inside the long-lived gateway process:
tools/process_registry.pykept parent-sidestdin/stdout/stderrpipe handles open after local background processes exited. Finished sessions remain in the registry for inspection, so those pipe FDs could accumulate until the process hit the OS limit.AIAgentinstances created in cron and gateway helper flows were never explicitly closed. That left provider/client resources and other agent-owned cleanup work dependent on process lifetime instead of request lifetime.Fix
This PR tightens cleanup in the affected paths:
ProcessRegistry._reader_loop()after the child process exits and has been reapedAIAgentinstances in:cron.scheduler.run_job()/backgroundtask execution flow/btwtask execution flow/compressflowHow to Verify
source venv/bin/activate && python -m pytest -o addopts='' tests/tools/test_process_registry.py tests/cron/test_scheduler.py tests/gateway/test_background_command.py tests/gateway/test_compress_command.py tests/gateway/test_flush_memory_stale_guard.py -q.close()/ pipe cleanup behavior.Test Plan
Focused test command run for this PR:
source venv/bin/activate && python -m pytest -o addopts='' tests/tools/test_process_registry.py tests/cron/test_scheduler.py tests/gateway/test_background_command.py tests/gateway/test_compress_command.py tests/gateway/test_flush_memory_stale_guard.py -qResult:
133 passedRisk Assessment
Low / Medium — behavior changes are narrowly scoped to resource cleanup after helper-agent usage and after local background-process exit. The main risk is cleanup timing, but the change avoids closing pipes during
kill_process()itself and keeps process-state transitions aligned with actual process exit.