Skip to content

fix: close leaked cron and gateway resources#8331

Open
coldxiangyu163 wants to merge 2 commits into
NousResearch:mainfrom
coldxiangyu163:fix/fd-leaks-cron-gateway
Open

fix: close leaked cron and gateway resources#8331
coldxiangyu163 wants to merge 2 commits into
NousResearch:mainfrom
coldxiangyu163:fix/fd-leaks-cron-gateway

Conversation

@coldxiangyu163

Copy link
Copy Markdown

Bug Description

Hermes gateway could hit OSError: [Errno 24] Too many open files during long-lived cron/gateway operation on macOS launchd setups with a low RLIMIT_NOFILE (for example 256).

Observed symptoms included:

  • cron jobs still running but failing to deliver to Feishu
  • tools.terminal_tool, tools.code_execution_tool, and write_file failing with Too many open files
  • cron state persistence failing after delivery attempts
  • even opening ~/.hermes/.env failing inside later cron runs

Root Cause

There were two resource-lifecycle problems contributing to descriptor pressure inside the long-lived gateway process:

  1. tools/process_registry.py kept parent-side stdin/stdout/stderr pipe handles open after local background processes exited. Finished sessions remain in the registry for inspection, so those pipe FDs could accumulate until the process hit the OS limit.
  2. Several temporary AIAgent instances created in cron and gateway helper flows were never explicitly closed. That left provider/client resources and other agent-owned cleanup work dependent on process lifetime instead of request lifetime.

Fix

This PR tightens cleanup in the affected paths:

  • add explicit local pipe cleanup in ProcessRegistry._reader_loop() after the child process exits and has been reaped
  • close temporary AIAgent instances in:
    • cron.scheduler.run_job()
    • gateway pre-reset memory flush flow
    • gateway session hygiene compression flow
    • /background task execution flow
    • /btw task execution flow
    • /compress flow
  • add regression tests that assert cleanup happens for these paths

How to Verify

  1. Run the focused regression tests:
    source venv/bin/activate && python -m pytest -o addopts='' tests/tools/test_process_registry.py tests/cron/test_scheduler.py tests/gateway/test_background_command.py tests/gateway/test_compress_command.py tests/gateway/test_flush_memory_stale_guard.py -q
  2. Confirm the suite passes and the new assertions verify .close() / pipe cleanup behavior.
  3. In a long-lived gateway session, run cron/background workloads that previously reproduced fd exhaustion and confirm the process no longer accumulates stale pipe handles from completed local background processes.

Test Plan

  • Added regression test for this bug
  • Existing targeted tests still pass
  • Manual verification of the fix in a live long-running gateway session

Focused test command run for this PR:

source venv/bin/activate && python -m pytest -o addopts='' tests/tools/test_process_registry.py tests/cron/test_scheduler.py tests/gateway/test_background_command.py tests/gateway/test_compress_command.py tests/gateway/test_flush_memory_stale_guard.py -q

Result: 133 passed

Risk Assessment

Low / Medium — behavior changes are narrowly scoped to resource cleanup after helper-agent usage and after local background-process exit. The main risk is cleanup timing, but the change avoids closing pipes during kill_process() itself and keeps process-state transitions aligned with actual process exit.

- close temp AIAgent instances in cron and gateway helper flows
- close background process pipe handles after local process exit
- add regression coverage for resource cleanup paths

@iRonin iRonin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: excellent fix, 3 minor suggestions

1. Repeated try/finally: agent.close() pattern (DRY)

The same pattern appears 5 times in gateway/run.py:

try:
    return agent.run_conversation(...)
finally:
    try:
        agent.close()
    except Exception:
        pass

This should be a context manager or helper to avoid copy-paste drift. Something like:

@contextmanager
def transient_agent(*args, **kwargs):
    agent = AIAgent(*args, **kwargs)
    try:
        yield agent
    finally:
        try:
            agent.close()
        except Exception:
            pass

This would reduce 5 occurrences to 5 one-line invocations and make future changes (e.g., adding logging) a single edit.

2. _close_process_pipes dedup via is may miss subprocess.STDOUT cases

The dedup logic uses any(handle is existing for existing in handles) which catches cases where stderr and stdout are the same object, but subprocess.Popen(stderr=subprocess.STDOUT) creates a different situation — proc.stderr is None and stderr is redirected to stdout at the OS level. The current code would miss this and attempt to close the same fd twice (though close() on an already-closed fd is idempotent on most platforms, it can raise OSError on some). Consider checking getattr(proc, attr, None) is not None for each attr and then deduping by handle.fileno() instead of object identity.

3. cron/scheduler.py: agent = None sentinel is correct but worth documenting

The agent = None before the try block ensures the finally cleanup runs even if AIAgent(**kwargs) raises. This is correct but non-obvious — a one-line comment explaining why the sentinel exists would help future maintainers.

Overall

This is the more impactful PR of the two (#8073 and this one). The FD leak in _reader_loop is a real production issue for long-running gateways, and closing transient agents prevents the accumulation of httpx/OpenAI client connections. Recommend merging this one first.

iRonin added a commit to iRonin/hermes-agent-nous that referenced this pull request Apr 14, 2026
- Add _transient_agent() context manager in gateway/run.py for one-shot
  agent usage (memory flush, /btw, /background) — guarantees close()
- Add explicit try/finally: agent.close() for patterns that need
  post-construction setup (_print_fn, session_id access)
- Add _close_process_pipes() with fd-based dedup to handle
  stderr=subprocess.STDOUT correctly (object identity misses this case)
- Add agent = None sentinel + close() in cron/scheduler.py finally block
  so cleanup runs even if AIAgent(**kwargs) raises

Improves on upstream PR NousResearch#8331 by:
- DRYing 5 copy-paste try/finally blocks into a single context manager
- Fixing pipe dedup to use fileno() instead of object identity
- Adding explanatory comment for the sentinel pattern
@coldxiangyu163

Copy link
Copy Markdown
Author

Thanks — great suggestions.

I pushed a follow-up that addresses the two low-cost/high-signal items directly:

  1. I factored the repeated best-effort agent cleanup into a small _safe_close_agent(...) helper in gateway/run.py, so the close path is now centralized instead of duplicated across transient-agent call sites.

  2. I updated _close_process_pipes() to dedupe by fileno() rather than object identity, which is a better fit for redirected pipe cases.

  3. I also added an inline comment for the agent = None sentinel in cron/scheduler.py to clarify why it exists.

I kept this pass intentionally small and avoided turning it into a broader refactor/context-manager change, but I agree with the DRY direction if we want to take that further later.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cron Cron scheduler and job management labels Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants