Summary
auxiliary_client._client_cache (line ~1685) accumulates AsyncOpenAI client entries indefinitely in long-running gateway processes. Each cached async client holds an httpx.AsyncClient bound to a specific event loop, which keeps the loop's kqueue selector and self-pipe unix sockets alive. Over days of operation, this exhausts the process file descriptor limit.
This is a deeper root cause than #8043, which addressed model_tools.py event loop cleanup but did not address the _client_cache accumulation.
Root Cause
The cache key for async clients includes id(asyncio.get_event_loop()) (line ~1817):
cache_key = (provider, async_mode, base_url, api_key, loop_id)
When worker threads are recycled by ThreadPoolExecutor (e.g., during cron job execution or gateway message handling), new threads get new event loops with new loop_id values. Each unique loop_id creates a new cache entry with a new AsyncOpenAI client. Old entries are never evicted — cleanup_stale_async_clients() (line ~1769) only removes entries whose loop .is_closed(), but the cached client itself holds a reference to the loop, preventing it from being closed.
Reproduction
import asyncio, concurrent.futures, gc, os, subprocess
from agent.auxiliary_client import _client_cache, _client_cache_lock, _get_cached_client
pid = os.getpid()
def fd_count():
r = subprocess.run(["lsof", "-p", str(pid)], capture_output=True, text=True)
lines = r.stdout.strip().split("\n")[1:]
kq = sum(1 for l in lines if "KQUEUE" in l)
return len(lines), kq
print(f"Before: total={fd_count()[0]}, KQUEUE={fd_count()[1]}")
for i in range(10):
pool = concurrent.futures.ThreadPoolExecutor(max_workers=1)
def run():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(asyncio.sleep(0))
_get_cached_client("custom", "test", async_mode=True,
base_url="https://example.com/v1", api_key="key")
pool.submit(run).result()
pool.shutdown(wait=True)
gc.collect()
print(f"After: total={fd_count()[0]}, KQUEUE={fd_count()[1]}")
# Result: 10 KQUEUE fds leaked (1 per unique loop_id)
with _client_cache_lock:
print(f"Cache entries: {len(_client_cache)}") # 10 entries, never cleaned
Observed Impact
On a macOS gateway running for ~4 days with 6 daily cron jobs + interactive chat:
- 56 KQUEUE fds (one per leaked event loop)
- 113 unix socket fds (self-pipe pairs, 2 per loop)
- 67 IPv4 fds (httpx connection pools)
- Total: 323 fds — exceeded macOS
launchctl limit maxfiles soft limit of 256
- All cron deliveries and new connections failed with
[Errno 24] Too many open files
Suggested Fix
- LRU/TTL eviction for
_client_cache: Cap the number of async client cache entries (e.g., 16) and close evicted clients explicitly.
- Thread-aware cleanup: In
cleanup_stale_async_clients(), also check whether the thread that created each cached loop is still alive. If the thread is dead, close the client and remove the entry.
- Periodic cleanup in gateway: Call
cleanup_stale_async_clients() periodically from the cron ticker or a dedicated cleanup task, not just after agent turns.
Environment
Related
Summary
auxiliary_client._client_cache(line ~1685) accumulatesAsyncOpenAIclient entries indefinitely in long-running gateway processes. Each cached async client holds anhttpx.AsyncClientbound to a specific event loop, which keeps the loop's kqueue selector and self-pipe unix sockets alive. Over days of operation, this exhausts the process file descriptor limit.This is a deeper root cause than #8043, which addressed
model_tools.pyevent loop cleanup but did not address the_client_cacheaccumulation.Root Cause
The cache key for async clients includes
id(asyncio.get_event_loop())(line ~1817):When worker threads are recycled by
ThreadPoolExecutor(e.g., during cron job execution or gateway message handling), new threads get new event loops with newloop_idvalues. Each uniqueloop_idcreates a new cache entry with a newAsyncOpenAIclient. Old entries are never evicted —cleanup_stale_async_clients()(line ~1769) only removes entries whose loop.is_closed(), but the cached client itself holds a reference to the loop, preventing it from being closed.Reproduction
Observed Impact
On a macOS gateway running for ~4 days with 6 daily cron jobs + interactive chat:
launchctl limit maxfilessoft limit of 256[Errno 24] Too many open filesSuggested Fix
_client_cache: Cap the number of async client cache entries (e.g., 16) and close evicted clients explicitly.cleanup_stale_async_clients(), also check whether the thread that created each cached loop is still alive. If the thread is dead, close the client and remove the entry.cleanup_stale_async_clients()periodically from the cron ticker or a dedicated cleanup task, not just after agent turns.Environment
Related
model_tools.pyevent loop cleanup (partial fix, does not cover_client_cache)_worker_thread_localcleanup was missing)