Summary
Two related reliability issues in gateway/run.py can cause the gateway to fail to start or leak httpx connections during cache churn.
Issue 1: PID file race condition causes spurious gateway exit
Location: start_gateway() around line 11144
Problem: When two gateway processes try to start simultaneously (e.g., after a crash or during auto-restart), both hit FileExistsError on the PID file and immediately exit, even though no gateway is actually running. This happens because the previous gateway may have been killed ungracefully (SIGTERM/SIGKILL) before its atexit handler could remove the PID file.
Current behavior:
except FileExistsError:
release_gateway_runtime_lock()
logger.error("PID file race lost to another gateway instance. Exiting.")
return False
Fix: On FileExistsError, check whether a real gateway process is still alive. If the PID file is stale (previous gateway crashed), remove it and retry once. Only exit if a live gateway is confirmed.
Issue 2: httpx connection leak during agent cache eviction
Location: _evict_cached_agent() around line 8810
Problem: When _evict_cached_agent() removes an AIAgent from the cache (on /new, /model switch, etc.), it simply pops the entry and lets the object be garbage collected. However, the AIAgent holds httpx AsyncClient connections for LLM providers. These are not closed before GC, causing connection pool leaks during cache churn.
Current behavior:
def _evict_cached_agent(self, session_key: str) -> None:
_lock = getattr(self, "_agent_cache_lock", None)
if _lock:
with _lock:
self._agent_cache.pop(session_key, None)
Fix: After popping the agent from the cache, call _release_evicted_agent_soft(agent) on a daemon thread to cleanly release httpx client resources before GC. _release_evicted_agent_soft already exists and calls agent.release_clients().
Related: Discord token failure causes reconnect storms
Location: gateway/platforms/discord.py
The Discord adapter repeatedly attempts to reconnect with an invalid/expired token (LoginFailure: Improper token has been passed), creating "Unclosed client session" aiohttp errors and filling logs. This is a separate but contributing issue — the reconnect loop can interfere with gateway stability during restarts. Fixing the Discord token will stop the reconnect storms.
Status
Fixes have been applied to the running instance. The PID file race fix is confirmed working — gateway successfully restarted after SIGKILL of the previous instance.
Files affected: gateway/run.py, gateway/platforms/discord.py
Summary
Two related reliability issues in
gateway/run.pycan cause the gateway to fail to start or leak httpx connections during cache churn.Issue 1: PID file race condition causes spurious gateway exit
Location:
start_gateway()around line 11144Problem: When two gateway processes try to start simultaneously (e.g., after a crash or during auto-restart), both hit
FileExistsErroron the PID file and immediately exit, even though no gateway is actually running. This happens because the previous gateway may have been killed ungracefully (SIGTERM/SIGKILL) before itsatexithandler could remove the PID file.Current behavior:
Fix: On
FileExistsError, check whether a real gateway process is still alive. If the PID file is stale (previous gateway crashed), remove it and retry once. Only exit if a live gateway is confirmed.Issue 2: httpx connection leak during agent cache eviction
Location:
_evict_cached_agent()around line 8810Problem: When
_evict_cached_agent()removes an AIAgent from the cache (on/new,/modelswitch, etc.), it simply pops the entry and lets the object be garbage collected. However, the AIAgent holds httpxAsyncClientconnections for LLM providers. These are not closed before GC, causing connection pool leaks during cache churn.Current behavior:
Fix: After popping the agent from the cache, call
_release_evicted_agent_soft(agent)on a daemon thread to cleanly release httpx client resources before GC._release_evicted_agent_softalready exists and callsagent.release_clients().Related: Discord token failure causes reconnect storms
Location:
gateway/platforms/discord.pyThe Discord adapter repeatedly attempts to reconnect with an invalid/expired token (
LoginFailure: Improper token has been passed), creating "Unclosed client session" aiohttp errors and filling logs. This is a separate but contributing issue — the reconnect loop can interfere with gateway stability during restarts. Fixing the Discord token will stop the reconnect storms.Status
Fixes have been applied to the running instance. The PID file race fix is confirmed working — gateway successfully restarted after SIGKILL of the previous instance.
Files affected:
gateway/run.py,gateway/platforms/discord.py