Skip to content

fix: ResponseStore SQLite connection pooling to prevent FD leak (#36111, #37369)#37389

Open
datin-antasena wants to merge 1 commit into
NousResearch:mainfrom
datin-antasena:fix/response-store-fd-leak
Open

fix: ResponseStore SQLite connection pooling to prevent FD leak (#36111, #37369)#37389
datin-antasena wants to merge 1 commit into
NousResearch:mainfrom
datin-antasena:fix/response-store-fd-leak

Conversation

@datin-antasena

Copy link
Copy Markdown

Summary

Fixes the file descriptor leak in ResponseStore that causes the gateway to hit OSError: [Errno 24] Too many open files after ~2 days of uptime.

Credit: This issue was first identified by @jscoltock in #36111 (May 31), with excellent root cause analysis. We independently hit the same problem in production (#37369) and built on their findings to implement the fix. Their proposed approach (singleton + close()) was the right direction.

Root Cause

ResponseStore.__init__() opens a new SQLite connection every time an APIServerAdapter is created (line 706), but:

  1. No close() method was called on adapter teardown
  2. No connection reuse between instances pointing to the same DB file
  3. Each gateway retry cycle accumulated 3 more FDs (db + WAL + SHM)

After ~45 hours with 8 Telegram topics, we observed 15 FDs to response_store.db (7-8 separate connections) when 3 is the healthy baseline.

Fix

Connection pooling with reference counting:

  • _connection_pool (class-level dict) keyed by resolved DB path
  • _ResponseStoreConnection wraps sqlite3.Connection with an RLock and ref_count
  • Multiple ResponseStore() instances for the same file share ONE underlying connection
  • close() decrements ref_count; connection is only closed when last reference is released
  • __del__ as safety net
  • APIServerAdapter.disconnect() now calls store.close()
  • APIServerAdapter.connect() recreates store if adapter is reused after disconnect

Thread safety: All store operations (get, put, delete, set_conversation, get_conversation) are serialized via shared.lock (RLock).

Tests

  • test_reuses_one_connection_per_db_path — verifies connection pooling
  • test_adapter_disconnect_closes_response_store — verifies adapter lifecycle cleanup
  • All 158 existing tests pass

Verification

# Before fix (20 min after restart):
ls -l /proc/$PID/fd | grep response_store | wc -l
# 15

# After fix:
ls -l /proc/$PID/fd | grep response_store | wc -l
# 3

Closes #37369
Closes #36111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

2 participants