Skip to content

Gateway run.py: PID file race condition and httpx connection leak on cache eviction #14598

@stormhierta

Description

@stormhierta

Summary

Two related reliability issues in gateway/run.py can cause the gateway to fail to start or leak httpx connections during cache churn.


Issue 1: PID file race condition causes spurious gateway exit

Location: start_gateway() around line 11144

Problem: When two gateway processes try to start simultaneously (e.g., after a crash or during auto-restart), both hit FileExistsError on the PID file and immediately exit, even though no gateway is actually running. This happens because the previous gateway may have been killed ungracefully (SIGTERM/SIGKILL) before its atexit handler could remove the PID file.

Current behavior:

except FileExistsError:
    release_gateway_runtime_lock()
    logger.error("PID file race lost to another gateway instance. Exiting.")
    return False

Fix: On FileExistsError, check whether a real gateway process is still alive. If the PID file is stale (previous gateway crashed), remove it and retry once. Only exit if a live gateway is confirmed.


Issue 2: httpx connection leak during agent cache eviction

Location: _evict_cached_agent() around line 8810

Problem: When _evict_cached_agent() removes an AIAgent from the cache (on /new, /model switch, etc.), it simply pops the entry and lets the object be garbage collected. However, the AIAgent holds httpx AsyncClient connections for LLM providers. These are not closed before GC, causing connection pool leaks during cache churn.

Current behavior:

def _evict_cached_agent(self, session_key: str) -> None:
    _lock = getattr(self, "_agent_cache_lock", None)
    if _lock:
        with _lock:
            self._agent_cache.pop(session_key, None)

Fix: After popping the agent from the cache, call _release_evicted_agent_soft(agent) on a daemon thread to cleanly release httpx client resources before GC. _release_evicted_agent_soft already exists and calls agent.release_clients().


Related: Discord token failure causes reconnect storms

Location: gateway/platforms/discord.py

The Discord adapter repeatedly attempts to reconnect with an invalid/expired token (LoginFailure: Improper token has been passed), creating "Unclosed client session" aiohttp errors and filling logs. This is a separate but contributing issue — the reconnect loop can interfere with gateway stability during restarts. Fixing the Discord token will stop the reconnect storms.


Status

Fixes have been applied to the running instance. The PID file race fix is confirmed working — gateway successfully restarted after SIGKILL of the previous instance.

Files affected: gateway/run.py, gateway/platforms/discord.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions