Skip to content

File descriptor leak in api_server platform: ResponseStore SQLite connections not closed on retry #36111

@jscoltock

Description

@jscoltock

Bug Description

The api_server platform accumulates file descriptors (FDs) over time due to SQLite WAL connections in ResponseStore not being properly closed during platform retry cycles.

Environment

  • OS: macOS 26.4.1 (Mac Mini)
  • Installation: Homebrew
  • Hermes version: (latest via Homebrew)
  • Gateway PID: 42506, uptime ~12 hours
  • FD limit: 65,535 (raised from Mac default 256)

Reproduction

  1. Enable api_server platform via API_SERVER_ENABLED=true in ~/.hermes/.env (with no API_SERVER_KEY set)
  2. Observe FD count: lsof -p $(pgrep -f 'hermes_cli.main gateway') | grep response_store.db | wc -l
  3. After 12 hours: 122+ FDs pointing to response_store.db on a single gateway process (PID 42506)
  4. This equals ~41 complete SQLite WAL connection sets (main db + WAL + SHM = 3 FDs each)

Root Cause

Three contributing factors:

1. api_server auto-enabled without API_SERVER_KEY

In gateway/config.py line 1486:

if api_server_enabled or api_server_key:
    config.platforms[Platform.API_SERVER] = PlatformConfig()

The platform is instantiated even when only API_SERVER_ENABLED=true is set, without a valid API_SERVER_KEY. The HTTP server refuses to start (Refusing to start: API_SERVER_KEY is required) but the adapter is still loaded into the gateway.

2. Connected check always returns True

In gateway/config.py line 425:

_PLATFORM_CONNECTED_CHECKERS = {
    Platform.API_SERVER: lambda cfg: True,  # always returns True
    ...
}

The api_server is always reported as "connected" regardless of whether it is actually running. This is misleading and may prevent proper retry/recovery logic.

3. ResponseStore opened at init, never closed

In gateway/platforms/api_server.py line 706:

class APIServerAdapter(BasePlatformAdapter):
    def __init__(self, config: PlatformConfig):
        ...
        self._response_store = ResponseStore()  # line 706

ResponseStore.__init__ opens a SQLite connection with WAL mode (sqlite3.connect(..., check_same_thread=False) + apply_wal_with_fallback). This is called at adapter __init__ time, not when the HTTP server starts. The connection is never explicitly closed — no close() method is defined on ResponseStore, and APIServerAdapter has no teardown logic for the store.

On each gateway restart or platform reconnect cycle, a new ResponseStore instance may be created while old ones are not garbage-collected, leading to accumulation of SQLite WAL file handles.

Impact

  • Gateway hits OSError: [Errno 24] Too many open files after ~17–24 hours of uptime
  • Cron jobs fail silently when the FD limit is reached (scheduler can't open files)
  • Kanban dispatcher fails (kanban_db.py fails first at line 1111)
  • Gateway becomes unresponsive and requires manual restart
  • 10+ unexpected restarts observed in one month on this setup

Proposed Fix

Fix 1: Require API_SERVER_KEY for platform to be loaded

# config.py line 1486 — change OR to AND
if api_server_enabled and api_server_key:
    config.platforms[Platform.API_SERVER] = PlatformConfig()

Fix 2: Fix the connected checker to validate key presence

_PLATFORM_CONNECTED_CHECKERS = {
    Platform.API_SERVER: lambda cfg: bool(cfg.extra.get("key")) if cfg else False,
    ...
}

Fix 3: Add close() to ResponseStore and call it on adapter teardown

# api_server.py — ResponseStore
def close(self):
    if self._conn:
        self._conn.close()
        self._conn = None

# api_server.py — APIServerAdapter
def stop(self):
    if self._response_store:
        self._response_store.close()
        self._response_store = None
    ...

Alternatively, make ResponseStore a process-wide singleton so repeated adapter instantiation does not create new SQLite connections.

Verification

# Count response_store.db FDs on gateway PID
lsof -p $(pgrep -f 'hermes_cli.main gateway') 2>/dev/null | grep response_store.db | wc -l

# Should be stable at ~3 (one set of db+wal+shm) after fix
# Before fix: grows by ~3 every gateway restart or retry cycle

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions