Skip to content

fix(db): close() methods on SQLite classes + WAL FD leak fix#36116

Open
someaka wants to merge 3 commits into
NousResearch:mainfrom
someaka:fix/db-resource-cleanup
Open

fix(db): close() methods on SQLite classes + WAL FD leak fix#36116
someaka wants to merge 3 commits into
NousResearch:mainfrom
someaka:fix/db-resource-cleanup

Conversation

@someaka

@someaka someaka commented May 31, 2026

Copy link
Copy Markdown

What

  1. Add close() methods to 5 SQLite-owning classes (acp_adapter/session.py, agent/insights.py, gateway/session.py, hermes_cli/kanban_db.py, plugins/memory/retaindb/__init__.py): Proper resource cleanup to prevent file descriptor leaks on shutdown.

  2. Fix WAL file descriptor leak in kanban DB (hermes_cli/kanban_db.py): The connect/close cycle was not properly closing WAL journal file descriptors, causing gradual FD exhaustion. Fixes kanban dispatcher FD leak: SQLite connections not releasing file descriptors in WAL mode #30799.

Why

SQLite connections that aren't properly closed leak file descriptors. In long-running gateway processes, this eventually hits the FD limit and causes database is locked errors.

Testing

  • Existing kanban tests pass
  • FD leak verified fixed by checking /proc/self/fd count before/after repeated connect/close cycles

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins comp/acp Agent Communication Protocol adapter labels May 31, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Partially addresses #36111 (ResponseStore FD leak in api_server). The close() methods added here cover several SQLite-owning classes but #36111 also identifies the api_server-specific issue of ResponseStore being instantiated without a valid API_SERVER_KEY and the always-true connected checker. Related to #31130 (WAL FD leak in kanban).

@mxnstrexgl mxnstrexgl left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: SQLite close() methods — no test coverage

This PR adds close() on SQLite connections/classes across 5 files. No test coverage in the diff.

Suggestion: Add a test that verifies close() is idempotent (calling twice does not throw) and that WAL FD leak is actually fixed. Even a simple test that opens + closes + reopens would catch regressions.

@someaka

someaka commented Jun 1, 2026

Copy link
Copy Markdown
Author

Added — close() idempotency test suite

Thanks for the suggestion @mxnstrexgl. Added tests/hermes_cli/test_db_close_idempotency.py with 13 tests covering close() idempotency across 4 classes:

Class Tests Key verifications
_WalSafeConnection 4 Idempotent close, WAL checkpoint fires, in-memory OK, already-closed graceful
InsightsEngine 3 Nulls _conn + db, idempotent, safe on fresh instance
SessionStore 3 Nulls _db, idempotent, safe when already None
SessionManager (acp_adapter) 3 Nulls _db_instance, idempotent, safe when already None

Key behaviors verified:

  • close() called twice never raises — all 4 classes
  • close() on already-None references is a no-op — all 4 classes
  • WAL checkpoint actually fires — verified data survives close/reopen cycle
  • In-memory databases work — _WalSafeConnection handles :memory: correctly

All 13 tests pass. ✅

@dsameer0-code

dsameer0-code commented Jun 1, 2026 via email

Copy link
Copy Markdown

@someaka someaka force-pushed the fix/db-resource-cleanup branch from 31309bb to 65d3a17 Compare June 7, 2026 21:51
@someaka someaka force-pushed the fix/db-resource-cleanup branch from 65d3a17 to d7093f6 Compare June 8, 2026 20:31
@someaka someaka force-pushed the fix/db-resource-cleanup branch from d7093f6 to 68c0b96 Compare June 8, 2026 22:20
Radical Edward and others added 3 commits June 9, 2026 02:57
- _WriteQueue: close thread-local sqlite3 connection
- RetainDBMemoryProvider: shutdown + close queue connection
- SessionStore: propagate close to underlying SessionDB
- InsightsEngine: release borrowed db._conn reference
- SessionManager: close lazily-initialised SessionDB

Without explicit close(), WAL checkpoints never run on these
connections and file descriptors leak in long-running processes.
NousResearch#30799)

Long-running gateway processes open a new kanban SQLite connection
every dispatcher tick.  In WAL mode SQLite defers cleanup of WAL/shm
file descriptors, causing a slow FD leak that eventually hits the
process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs
after 14 hours) and triggers cascading failures.

Fix: add _WalSafeConnection, a sqlite3.Connection subclass that runs
PRAGMA wal_checkpoint(TRUNCATE) before each close.  This forces SQLite
to consolidate the WAL and release its file descriptors immediately.
All existing callers get this behaviour automatically since connect()
uses the subclass via the factory parameter.
Addresses review feedback from mxnstrexgl on NousResearch#36116:

- 13 tests covering close() idempotency on 4 classes:
  * _WalSafeConnection: idempotent, WAL checkpoint verified, in-memory OK
  * InsightsEngine: nulls _conn and db references, safe on fresh instance
  * SessionStore: nulls _db, safe when already None
  * SessionManager (acp_adapter): nulls _db_instance, safe when already None

- Key behaviors verified:
  * close() called twice never raises
  * close() on already-None references is a no-op
  * WAL checkpoint actually fires before connection close
  * Data survives close/reopen cycle
@someaka someaka force-pushed the fix/db-resource-cleanup branch from 68c0b96 to c042176 Compare June 9, 2026 00:57
someaka added a commit to someaka/hermes-agent that referenced this pull request Jun 9, 2026
Addresses review feedback from mxnstrexgl on NousResearch#36116:

- 13 tests covering close() idempotency on 4 classes:
  * _WalSafeConnection: idempotent, WAL checkpoint verified, in-memory OK
  * InsightsEngine: nulls _conn and db references, safe on fresh instance
  * SessionStore: nulls _db, safe when already None
  * SessionManager (acp_adapter): nulls _db_instance, safe when already None

- Key behaviors verified:
  * close() called twice never raises
  * close() on already-None references is a no-op
  * WAL checkpoint actually fires before connection close
  * Data survives close/reopen cycle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/acp Agent Communication Protocol adapter comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban dispatcher FD leak: SQLite connections not releasing file descriptors in WAL mode

5 participants