Skip to content

Bug: auxiliary_client._client_cache accumulates AsyncOpenAI clients indefinitely, causing fd exhaustion in long-running gateway #10200

@morningto

Description

@morningto

Summary

auxiliary_client._client_cache (line ~1685) accumulates AsyncOpenAI client entries indefinitely in long-running gateway processes. Each cached async client holds an httpx.AsyncClient bound to a specific event loop, which keeps the loop's kqueue selector and self-pipe unix sockets alive. Over days of operation, this exhausts the process file descriptor limit.

This is a deeper root cause than #8043, which addressed model_tools.py event loop cleanup but did not address the _client_cache accumulation.

Root Cause

The cache key for async clients includes id(asyncio.get_event_loop()) (line ~1817):

cache_key = (provider, async_mode, base_url, api_key, loop_id)

When worker threads are recycled by ThreadPoolExecutor (e.g., during cron job execution or gateway message handling), new threads get new event loops with new loop_id values. Each unique loop_id creates a new cache entry with a new AsyncOpenAI client. Old entries are never evicted — cleanup_stale_async_clients() (line ~1769) only removes entries whose loop .is_closed(), but the cached client itself holds a reference to the loop, preventing it from being closed.

Reproduction

import asyncio, concurrent.futures, gc, os, subprocess
from agent.auxiliary_client import _client_cache, _client_cache_lock, _get_cached_client

pid = os.getpid()

def fd_count():
    r = subprocess.run(["lsof", "-p", str(pid)], capture_output=True, text=True)
    lines = r.stdout.strip().split("\n")[1:]
    kq = sum(1 for l in lines if "KQUEUE" in l)
    return len(lines), kq

print(f"Before: total={fd_count()[0]}, KQUEUE={fd_count()[1]}")

for i in range(10):
    pool = concurrent.futures.ThreadPoolExecutor(max_workers=1)
    def run():
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(asyncio.sleep(0))
        _get_cached_client("custom", "test", async_mode=True,
                           base_url="https://example.com/v1", api_key="key")
    pool.submit(run).result()
    pool.shutdown(wait=True)

gc.collect()
print(f"After: total={fd_count()[0]}, KQUEUE={fd_count()[1]}")
# Result: 10 KQUEUE fds leaked (1 per unique loop_id)

with _client_cache_lock:
    print(f"Cache entries: {len(_client_cache)}")  # 10 entries, never cleaned

Observed Impact

On a macOS gateway running for ~4 days with 6 daily cron jobs + interactive chat:

  • 56 KQUEUE fds (one per leaked event loop)
  • 113 unix socket fds (self-pipe pairs, 2 per loop)
  • 67 IPv4 fds (httpx connection pools)
  • Total: 323 fds — exceeded macOS launchctl limit maxfiles soft limit of 256
  • All cron deliveries and new connections failed with [Errno 24] Too many open files

Suggested Fix

  1. LRU/TTL eviction for _client_cache: Cap the number of async client cache entries (e.g., 16) and close evicted clients explicitly.
  2. Thread-aware cleanup: In cleanup_stale_async_clients(), also check whether the thread that created each cached loop is still alive. If the thread is dead, close the client and remove the entry.
  3. Periodic cleanup in gateway: Call cleanup_stale_async_clients() periodically from the cron ticker or a dedicated cleanup task, not just after agent turns.

Environment

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions