fix: bound auxiliary client cache to prevent fd exhaustion in long-running gateways by teknium1 · Pull Request #10470 · NousResearch/hermes-agent

teknium1 · 2026-04-15T19:00:12Z

Summary

Fixes #10200 — _client_cache in auxiliary_client.py accumulated unbounded entries because event loop id() was part of the cache key. Every new worker-thread event loop created a new entry for the same provider config. In long-running gateways where threads recycle frequently, this exhausted file descriptors after days of operation.

Root Cause

The cache key included loop_id = id(current_loop). When gateway worker threads create new event loops (via _run_async()/asyncio.run()), each loop gets a unique id(). The cache held a reference to the old loop object, preventing GC and ensuring new loops always got different IDs. Old entries with dead loops piled up — each holding an unclosed AsyncOpenAI client with its httpx connection pool (KQUEUE fds, unix sockets, IPv4 fds).

Fix

Remove loop_id from cache key — the logical key is now (provider, async_mode, base_url, api_key, api_mode, runtime_key)
Validate loop at hit time — on async cache hits, check that the cached loop is the current, open loop. If the loop changed or was closed, force-close the stale client and replace the entry in-place
Add _CLIENT_CACHE_MAX_SIZE = 64 safety belt — FIFO eviction as defense-in-depth

This bounds cache growth to one entry per unique provider config rather than one per (config × event-loop). Cross-loop safety is preserved: different loops still get different client instances (validated by the existing TestCrossLoopCacheIsolation suite).

E2E Verification

Simulated 20 sequential worker threads with different event loops for the same provider:

Before: 20 cache entries (one per loop) → unbounded growth → fd exhaustion
After: 1 cache entry (replaced in-place) + 20 unique clients (cross-loop safe)

Test Results

14 targeted tests pass (9 in test_async_httpx_del_neuter.py + 5 in test_crossloop_client_cache.py)
3 new tests: TestClientCacheBoundedGrowth — stale loop replacement, no-growth verification, max-size eviction
1885 passing in broader agent/run_agent suite (9 pre-existing failures unrelated to this change)

…nning gateways (#10200) The _client_cache used event loop id() as part of the cache key, so every new worker-thread event loop created a new entry for the same provider config. In long-running gateways where threads are recycled frequently, this caused unbounded cache growth — each stale entry held an unclosed AsyncOpenAI client with its httpx connection pool, eventually exhausting file descriptors. Fix: remove loop_id from the cache key and instead validate on each async cache hit that the cached loop is the current, open loop. If the loop changed or was closed, the stale entry is replaced in-place rather than creating an additional entry. This bounds cache growth to at most one entry per unique provider config. Also adds a _CLIENT_CACHE_MAX_SIZE (64) safety belt with FIFO eviction as defense-in-depth against any remaining unbounded growth. Cross-loop safety is preserved: different event loops still get different client instances (validated by existing test suite). Closes #10200

…nning gateways (NousResearch#10200) (NousResearch#10470) The _client_cache used event loop id() as part of the cache key, so every new worker-thread event loop created a new entry for the same provider config. In long-running gateways where threads are recycled frequently, this caused unbounded cache growth — each stale entry held an unclosed AsyncOpenAI client with its httpx connection pool, eventually exhausting file descriptors. Fix: remove loop_id from the cache key and instead validate on each async cache hit that the cached loop is the current, open loop. If the loop changed or was closed, the stale entry is replaced in-place rather than creating an additional entry. This bounds cache growth to at most one entry per unique provider config. Also adds a _CLIENT_CACHE_MAX_SIZE (64) safety belt with FIFO eviction as defense-in-depth against any remaining unbounded growth. Cross-loop safety is preserved: different event loops still get different client instances (validated by existing test suite). Closes NousResearch#10200

teknium1 merged commit 6391b46 into main Apr 15, 2026
4 of 5 checks passed

teknium1 deleted the fix/client-cache-fd-exhaustion branch April 15, 2026 20:16

This was referenced Apr 27, 2026

fix: dedup TTL expiry, compression provider fallback, TCP keepalive CLOSE-WAIT #10405

Closed

fix(gateway,cron): close ephemeral agents + reap stale aux clients (salvage #13979) #16598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound auxiliary client cache to prevent fd exhaustion in long-running gateways#10470

fix: bound auxiliary client cache to prevent fd exhaustion in long-running gateways#10470
teknium1 merged 1 commit into
mainfrom
fix/client-cache-fd-exhaustion

teknium1 commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Apr 15, 2026

Summary

Root Cause

Fix

E2E Verification

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant