Skip to content

fix(gateway): close ResponseStore on disconnect and require key to load api_server platform#36180

Open
arimu1 wants to merge 1 commit into
NousResearch:mainfrom
arimu1:fix/api-server-response-store-fd-leak
Open

fix(gateway): close ResponseStore on disconnect and require key to load api_server platform#36180
arimu1 wants to merge 1 commit into
NousResearch:mainfrom
arimu1:fix/api-server-response-store-fd-leak

Conversation

@arimu1

@arimu1 arimu1 commented Jun 1, 2026

Copy link
Copy Markdown

Summary

Three targeted changes that together eliminate the SQLite WAL file descriptor leak reported in #36111:

  1. Require API_SERVER_KEY to load the platform (gateway/config.py)
  2. Fix the connected checker to validate key presence (gateway/config.py)
  3. Close ResponseStore in disconnect() (gateway/platforms/api_server.py)

Problem

After ~17–24 hours of uptime on a gateway with API_SERVER_ENABLED=true but no API_SERVER_KEY, file descriptors accumulated until OSError: [Errno 24] Too many open files crashed the gateway. Three layered root causes:

Root cause 1 — platform loaded without a key
API_SERVER_ENABLED=true without API_SERVER_KEY still passed the or condition, instantiating APIServerAdapter and opening a ResponseStore/SQLite connection, even though the HTTP server immediately refuses to start with "Refusing to start: API_SERVER_KEY is required".

Root cause 2 — connected checker always True
Platform.API_SERVER: lambda cfg: True always reported the platform as connected, hiding misconfiguration from retry/health logic.

Root cause 3 — ResponseStore never closed
ResponseStore.close() existed but was never called on adapter teardown. Each gateway retry cycle created a new ResponseStore (3 FDs: db + WAL + SHM) while old connections were not GC'd. The reporter measured 122 open FDs to response_store.db after 12 hours (~41 leaked connection sets).

Changes

File Change Lines
gateway/config.py if api_server_enabled or api_server_keyif api_server_key 1
gateway/config.py Connected checker: lambda: Truelambda cfg: bool(cfg.extra.get("key")) 1
gateway/platforms/api_server.py Call self._response_store.close() + null in disconnect() 3

Backward compatibility: Setting only API_SERVER_KEY (without API_SERVER_ENABLED) continues to work — the key alone is sufficient to load the platform, matching the existing docs and the intent of the api_server_key check already present in the block body.

Verification

# FD count before fix (grows ~3 per retry cycle):
lsof -p $(pgrep -f 'hermes_cli.main gateway') 2>/dev/null | grep response_store.db | wc -l

# After fix: stable at ~3 (one active connection set) regardless of uptime
  • tests/gateway/test_api_server_toolset.py — 32 passed
  • tests/gateway/test_config.py — 32 passed

Existing PRs

Searched open and closed PRs — no prior attempt found for this issue.

Checklist

  • Commit follows Conventional Commits (fix(gateway):)
  • Searched for existing PRs — none found
  • PR contains only changes related to this fix (2 files, 5 lines)
  • Existing gateway tests pass

AI disclosure: fix developed with Claude Sonnet 4.6 (arimu1 reviewed and approved the diff)

…ad api_server

Three changes that together eliminate the SQLite WAL file descriptor leak
reported in NousResearch#36111:

1. gateway/config.py — require API_SERVER_KEY to load the platform.
   Previously `API_SERVER_ENABLED=true` without a key caused the adapter
   to be instantiated (opening a ResponseStore/SQLite connection) even
   though the HTTP server would immediately refuse to start. Changing the
   condition from `or` to `if api_server_key` prevents the spurious load.
   Setting only API_SERVER_KEY (without the explicit flag) still works —
   the key alone is sufficient intent to enable the platform.

2. gateway/config.py — fix _PLATFORM_CONNECTED_CHECKERS for API_SERVER.
   The checker was a no-op `lambda: True`; it now returns True only when
   the platform config has a `key` stored, matching actual readiness.

3. gateway/platforms/api_server.py — call response_store.close() in
   disconnect(). ResponseStore.close() already existed but was never
   called on adapter teardown, leaving the SQLite connection (+ WAL + SHM
   = 3 FDs) open. On each gateway retry cycle a new adapter/store was
   created while the old FDs were not GC'd, accumulating ~3 FDs per cycle
   until EMFILE after ~17–24 hours uptime.

Fixes NousResearch#36111

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels Jun 1, 2026

@mxnstrexgl mxnstrexgl left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — automated review passed. No security, quality, or test coverage issues detected.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Review

fix(gateway): close ResponseStore on disconnect and require key to load api_server platform

Excellent 3-part fix for a production FD leak. Key observations:

  • Root cause analysis: Excellent — three layered causes identified (platform loaded without key, connected checker always true, ResponseStore never closed) with clear evidence (122 open FDs after 12 hours).
  • Fix completeness: All three root causes are addressed independently:
    1. if api_server_key (not api_server_enabled or api_server_key) — prevents platform instantiation without auth
    2. lambda cfg: bool(cfg.extra.get("key")) — honest connected checker
    3. ResponseStore.close() + null in disconnect() — proper teardown
  • Backward compatibility: Setting only API_SERVER_KEY (without API_SERVER_ENABLED) continues to work, matching existing docs.
  • Tests confirmed: 32 test_api_server_toolset + 32 test_config pass.

Looks Good

  • Minimal, targeted changes (5 lines + 2 deletions)
  • Clear commit message
  • Each root cause has its own independent fix

Reviewed by Hermes Agent

@arimu1

arimu1 commented Jun 6, 2026

Copy link
Copy Markdown
Author

Friendly ping — this has two approvals and no blocking comments. Happy to rebase or make any adjustments if needed before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants