fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#29544
Closed
xxxigm wants to merge 3 commits into
Closed
fix(api-call): close FD-recycling race that wrote TLS bytes into kanban.db (#29507)#29544xxxigm wants to merge 3 commits into
xxxigm wants to merge 3 commits into
Conversation
…esearch#29507) The helper used to call ``socket.shutdown(SHUT_RDWR)`` followed by ``socket.close()`` to drop CLOSE-WAIT entries immediately. On its own ``shutdown()`` is safe from any thread — it only sends FIN and breaks pending ``recv``/``send`` — but ``close()`` releases the FD integer to the kernel. When the helper runs on a stranger thread (the interrupt loop, the stale-call detector) the FD release races the owning httpx worker thread that still has the same integer cached inside the SSL BIO. The kernel then recycles that integer to the next ``open()`` call — in production, kanban dispatcher's ``kanban.db`` — and the worker's delayed TLS flush writes a 24-byte TLS application-data record on top of the SQLite header. Restrict the helper to ``shutdown(SHUT_RDWR)`` only. The owning httpx worker's own unwind will close the underlying socket via the same Python ``socket.socket`` object, which atomically swaps ``_fd`` to -1 before issuing ``close(2)`` — no FD-aliasing window. The log field ``tcp_force_closed=N`` is kept (now counts shutdowns) so existing dashboards / log parsers keep working.
…upt (NousResearch#29507) Layer-2 defense for the FD-recycling race: even with ``force_close_tcp_sockets`` reduced to shutdown-only, the followup ``client.close()`` in ``_close_openai_client`` still walks the httpx pool and closes sockets — and if called from a stranger thread (the interrupt-check loop, the stale-call detector) it has the same FD-recycling exposure that wrote a TLS record on top of ``kanban.db``. Stamp the request_client_holder with the owning thread's ident at ``_set_request_client`` time. In ``_close_request_client_once``: * Owning thread (the worker's ``finally``) → pop + ``client.close()`` via ``_close_request_openai_client``, exactly as before. * Stranger thread → ``_abort_request_openai_client`` (new): only ``shutdown(SHUT_RDWR)`` the pool sockets and log a deferred-close marker. The holder stays populated so the worker's eventual ``finally`` performs the real close from its own thread context, where the FD release races nothing. Applied symmetrically to both the non-streaming ``interruptible_api_call`` and the streaming variant — both routinely get hit by stranger-thread interrupts. The log field ``tcp_force_closed=N`` keeps its existing shape; the new abort path adds ``deferred_close=stranger_thread`` so production triage can distinguish the two close kinds.
…NousResearch#29507) Ten regressions across both prongs of the NousResearch#29507 fix, organised so each test names exactly which way the bug could come back: Prong 1 — ``force_close_tcp_sockets``: * ``shutdown_only_no_close`` is the smoking-gun assertion. If a future refactor adds back ``sock.close()`` to this helper, the FD-recycling race that wrote TLS bytes on top of ``kanban.db`` is back, and this trips. * ``uses_shut_rdwr`` pins that both halves are shut down (a half-close wouldn't unblock a worker stuck in ``recv``). * ``swallows_oserror_on_shutdown`` covers the already-shutdown case. * ``handles_multiple_pool_entries`` walks all pool connections. Prong 2 — thread-aware ``_close_request_client_once``: * ``stranger_thread_aborts_only_no_close`` simulates the asyncio_0 → Thread-1616 interrupt path: stranger drives abort, holder stays populated for the worker's eventual finally. * ``owner_thread_pops_and_full_close`` is the worker-thread path: pops + full close. * ``stranger_then_owner_close_sequence_runs_full_close_exactly_once`` replays the reporter's exact timeline at object level: abort runs once, full close runs once, holder ends empty. Agent surface: * ``_abort_request_openai_client_does_not_call_client_close`` pins that the new entrypoint shuts sockets and emits the ``deferred_close=stranger_thread`` marker but never calls ``client.close()``. * ``_abort_request_openai_client_null_client_is_noop`` defensive. End-to-end: * ``fd_recycle_window_closed_by_shutdown_only`` reproduces the race at object level — runs the abort path from a stranger thread and asserts that no ``close()`` ever fires, so the kernel can never recycle the FD under the owner's still-active reference.
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #29507. P1 — TLS application-data bytes were being written on top of unrelated file descriptors (in production:
kanban.db's SQLite header), causing silent persistent-state corruption and data loss on recovery.Root cause
The reporter's forensic is conclusive — the 24 bytes that overwrote bytes 5..28 of
kanban.dbare exactly the shape of a TLS 1.2 application-data record (5-byte header17 03 03 00 13+ 19-byte encrypted payload). The only producer of that byte pattern in the process was the OpenAI/httpx TLS stream from the interrupted Codex request.Mechanism:
Thread-1616, the_callthread insideinterruptible_api_call) creates an OpenAI client, gets a TCP socket with FDX, hands it down through httpx → httpcore → ssl → OpenSSL BIO. The BIO cachesXas a raw integer.asyncio_0(a stranger thread w.r.t. the worker) and calls_close_request_client_once(\"interrupt_abort\"). That walksforce_close_tcp_sockets(shutdown+close) and thenclient.close()(which again closes pool sockets).socket.close()releases the integer FD to the kernel.open(2)in the same process — the kanban dispatcher openingkanban.db— gets FDX.recvreturns), httpx flushes one last TLS application-data record via the still-live BIO, which callswrite(X, …). The bytes land inkanban.dbbecause the kernel reassignedX.This is FD aliasing across threads: the SSL layer's BIO caches the FD as an integer, so once a stranger thread releases that integer to the kernel, anything else opening a new FD wins the race against the worker's pending TLS flush.
Fix (two prongs, both required)
Prong 1 —
force_close_tcp_socketsis shutdown-onlyshutdown(SHUT_RDWR)from any thread is FD-safe — it sends FIN, breaks pendingrecv/send, but does NOT release the FD integer. The owning httpx worker's natural unwind closes the underlyingsocket.socketvia the same Python object, which atomically swaps_fd→-1before issuingclose(2)— so the FD release happens from exactly one thread that nobody else races.Prong 2 —
_close_request_client_onceis thread-awareEven with prong 1, the followup
client.close()in_close_openai_clientwalks the httpx pool and closes sockets from whatever thread invoked it. We stamp the per-request client holder with the owning thread ident at_set_request_clienttime and:finally) → pop holder + fullclient.close()as before._abort_request_openai_client:shutdownsockets only, logdeferred_close=stranger_thread, do not touch the holder. The worker'sfinallylater sees the client still there and closes it from its own thread context — where FD release races nothing.Applied symmetrically to both the non-streaming
interruptible_api_calland the streaming variant.Prong 3 (defense-in-depth) — already exists
hermes_cli.kanban_db._validate_sqlite_headeralready detects this exact TLS-shaped corruption at open time and refuses to operate on the file (that's thekanban dispatcher: ... is not a valid SQLite databaseline in the reporter's log). Kept intact — it's the reactive layer; this PR is the proactive layer that stops the bytes from being written in the first place.Why three commits?
force_close_tcp_sockets(dropsock.close(),shutdownonly) with a long docstring naming the failure mode by issue number.interruptible_api_callvariants, plus the new_abort_request_openai_clientagent method that does shutdown-only with a distinguishable log line.tests/run_agent/test_tls_fd_recycle_corruption.py, plus the existingtest_force_close_tcp_sockets_descends_httpcore_1_connection_wrapperupdated to assertclose_calls == 0(was1).Test plan
pytest tests/run_agent/test_tls_fd_recycle_corruption.py→ 10 new tests pass.pytest tests/run_agent/test_streaming.py tests/run_agent/test_stream_interrupt_retry.py tests/run_agent/test_create_openai_client_reuse.py tests/run_agent/test_primary_runtime_restore.py tests/hermes_cli/test_kanban_db.py→ 247 existing tests still pass.OpenAI client aborted (... deferred_close=stranger_thread)log lines on/stop-interrupt scenarios — the absence of subsequentkanban dispatcher: ... is not a valid SQLite databaseerrors confirms the race is closed.Observability changes (heads-up for dashboards)
The interrupt-abort path now emits
OpenAI client aborted (..., tcp_force_closed=N, deferred_close=stranger_thread)instead ofOpenAI client closed (..., tcp_force_closed=N). Thetcp_force_closed=Nfield shape is preserved so existing parsers keep working; the newdeferred_close=stranger_threadmarker is purely additive.